[Feature]: Support flashdecoding++ heuristic dataflow and profiling optimization

chizhang118 commented 2 months ago

Q3 Collaboration Plan of Infra and IaaS Labs: https://bytedance.us.larkoffice.com/docx/HKXfdRh1noMrbAxcgL2ureGasdQ
FlashDecoding++ Summary: https://bytedance.larkoffice.com/wiki/WbqXwRL3qi0x18kJVkAcU0HZnte, including Asynchronized Softmax, Double Buffering, Heuristic Dataflow, and more profiling-related optimizations.
The new kv cache memory layout and performance benchmarks is also needed.
Inference Improvement Weekly Status and Progress: https://bytedance.us.larkoffice.com/docx/RGnPdj5gfoBN3YxuY3yuLtlQsjb

chizhang118 commented 2 months ago

chizhang118 commented 1 month ago

Implemented the flat gemm and support heuristic dataflow with small batch decoding using flat gemm.

chizhang118 commented 1 month ago

chizhang118 commented 1 month ago

bd-iaas-us / vllm