bd-iaas-us / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
2 stars 1 forks source link

[Feature]: Support flashdecoding++ async unified softmax #20

Open chizhang118 opened 1 month ago

chizhang118 commented 1 month ago

flashdecoding++ paper: https://arxiv.org/abs/2311.01282

chizhang118 commented 1 month ago

Primarily related method paged_attention_v2_reduce_kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cu#L567

chizhang118 commented 1 month ago

20240802-144852

chizhang118 commented 1 month ago

20240802-145033

JackChuang commented 1 month ago

I'm taking this issue!! Please assign this to me. Thanks.

JackChuang commented 3 weeks ago

Check different block sizes and max qk sizes for a 7b model. Try to change the implementation of the qk calculation.

JackChuang commented 1 week ago

Confirm some implementation-related details: