InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.74k stars 432 forks source link

[Feature] long context inference optimization #1879

Open zhyncs opened 5 months ago

zhyncs commented 5 months ago

Motivation

This is an interesting blog post FireAttention V2: 12x faster to make Long Contexts practical for Online Inference, which hardly reveals any technical details. From the perspective of benchmark results, in the long context scenario of Qwen 2 72b, using H100 and enabling fp8, the performance is far ahead of vLLM. The load testing tool https://github.com/fw-ai/benchmark provided by FireWorks AI is also meaningful for our reference. Currently, due to the ban on sales of H100 in mainland China, developers do not have a suitable development and benchmark environment. But from the results on that blog, it seems very necessary to support fp8. And it is also very necessary to optimize long context inference. @lzhangzz @grimoire @lvhan028

Related resources

No response

Additional context

No response

lvhan028 commented 5 months ago

You can't imagine how much we want to experiment with FP8. 🥺

lzhangzz commented 5 months ago

The gap is likely to be caused by the Hopper async WGMMA capability, which is necessary to achieve peak MMA throughput on H100.

zhyncs commented 4 months ago

ref https://github.com/OpenBMB/InfiniteBench/blob/main/data/construct_synthetic_dataset.py It's a useful tool for generating long context dataset. aspired by https://zhuanlan.zhihu.com/p/708441783 cc @hijkzzz