Open chizhang118 opened 3 months ago
Primarily related method paged_attention_v2_reduce_kernel: https://github.com/vllm-project/vllm/blob/main/csrc/attention/attention_kernels.cu#L567
I'm taking this issue!! Please assign this to me. Thanks.
Check different block sizes and max qk sizes for a 7b model. Try to change the implementation of the qk calculation.
Confirm some implementation-related details:
-- Tried different qk_max values to understand how it affects the precision of the inference. -- Collected the profiling data about qk_max, wrote the script to find the range (99%), and figured out an appropriate value to be used.
Goal:
Confirm some implementation-related details:
Goal: [-] Comparing perf results (token/s): vanilla dAttn vs unified qk_max dAttn [O] Implement per-head shifting window dAttention code (for ongoing patent writing) [-] Comparing rollback counts: unified qk_max dAttn vs shifting window (phi) dAttn (for ongoing patent writing)
flashdecoding++ paper: https://arxiv.org/abs/2311.01282
Q3 Collaboration Plan of Infra and IaaS Labs: https://bytedance.us.larkoffice.com/docx/HKXfdRh1noMrbAxcgL2ureGasdQ
FlashDecoding++ Summary: https://bytedance.larkoffice.com/wiki/WbqXwRL3qi0x18kJVkAcU0HZnte, including Asynchronized Softmax, Double Buffering, Heuristic Dataflow, and more profiling-related optimizations.
The new kv cache memory layout and performance benchmarks is also needed.
Inference Improvement Weekly Status and Progress: https://bytedance.us.larkoffice.com/docx/RGnPdj5gfoBN3YxuY3yuLtlQsjb