InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.13k stars 280 forks source link

[Feature] long context inference optimization #1879

Open zhyncs opened 4 days ago

zhyncs commented 4 days ago

Motivation

This is an interesting blog post FireAttention V2: 12x faster to make Long Contexts practical for Online Inference, which hardly reveals any technical details. From the perspective of benchmark results, in the long context scenario of Qwen 2 72b, using H100 and enabling fp8, the performance is far ahead of vLLM. The load testing tool https://github.com/fw-ai/benchmark provided by FireWorks AI is also meaningful for our reference. Currently, due to the ban on sales of H100 in mainland China, developers do not have a suitable development and benchmark environment. But from the results on that blog, it seems very necessary to support fp8. And it is also very necessary to optimize long context inference. @lzhangzz @grimoire @lvhan028

Related resources

No response

Additional context

No response

lvhan028 commented 3 days ago

You can't imagine how much we want to experiment with FP8. 🥺

lzhangzz commented 3 days ago

The gap is likely to be caused by the Hopper async WGMMA capability, which is necessary to achieve peak MMA throughput on H100.