Closed chizhang118 closed 2 weeks ago
Profling heuristic dataflow: https://github.com/vllm-project/vllm/blob/df845b2b46c3e30f5bd3e3be286285ed148323fc/vllm/worker/worker.py#L248
Implemented the flat gemm and support heuristic dataflow with small batch decoding using flat gemm.
Q3 Collaboration Plan of Infra and IaaS Labs: https://bytedance.us.larkoffice.com/docx/HKXfdRh1noMrbAxcgL2ureGasdQ
FlashDecoding++ Summary: https://bytedance.larkoffice.com/wiki/WbqXwRL3qi0x18kJVkAcU0HZnte, including Asynchronized Softmax, Double Buffering, Heuristic Dataflow, and more profiling-related optimizations.
The new kv cache memory layout and performance benchmarks is also needed.
Inference Improvement Weekly Status and Progress: https://bytedance.us.larkoffice.com/docx/RGnPdj5gfoBN3YxuY3yuLtlQsjb