bd-iaas-us / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
4 stars 1 forks source link

[Feature]: Support flashdecoding++ heuristic dataflow and profiling optimization #22

Closed chizhang118 closed 2 weeks ago

chizhang118 commented 2 months ago
chizhang118 commented 2 months ago

Profling heuristic dataflow: https://github.com/vllm-project/vllm/blob/df845b2b46c3e30f5bd3e3be286285ed148323fc/vllm/worker/worker.py#L248

chizhang118 commented 1 month ago

Implemented the flat gemm and support heuristic dataflow with small batch decoding using flat gemm.

https://code.byted.org/inf/dAttention/tree/feat/flat-gemm

chizhang118 commented 1 month ago

https://code.byted.org/inf/dAttention/merge_requests/13

chizhang118 commented 1 month ago

https://code.byted.org/inf/dAttention/merge_requests/18 Code merged.