Open chizhang118 opened 1 month ago
Primarily related method paged_attention_v2_reduce_kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cu#L567
I'm taking this issue!! Please assign this to me. Thanks.
Check different block sizes and max qk sizes for a 7b model. Try to change the implementation of the qk calculation.
Confirm some implementation-related details:
flashdecoding++ paper: https://arxiv.org/abs/2311.01282
Q3 Collaboration Plan of Infra and IaaS Labs: https://bytedance.us.larkoffice.com/docx/HKXfdRh1noMrbAxcgL2ureGasdQ
FlashDecoding++ Summary: https://bytedance.larkoffice.com/wiki/WbqXwRL3qi0x18kJVkAcU0HZnte, including Asynchronized Softmax, Double Buffering, Heuristic Dataflow, and more profiling-related optimizations.
The new kv cache memory layout and performance benchmarks is also needed.
Inference Improvement Weekly Status and Progress: https://bytedance.us.larkoffice.com/docx/RGnPdj5gfoBN3YxuY3yuLtlQsjb