Open zhyncs opened 5 months ago
You can't imagine how much we want to experiment with FP8. 🥺
The gap is likely to be caused by the Hopper async WGMMA capability, which is necessary to achieve peak MMA throughput on H100.
ref https://github.com/OpenBMB/InfiniteBench/blob/main/data/construct_synthetic_dataset.py It's a useful tool for generating long context dataset. aspired by https://zhuanlan.zhihu.com/p/708441783 cc @hijkzzz
Motivation
This is an interesting blog post FireAttention V2: 12x faster to make Long Contexts practical for Online Inference, which hardly reveals any technical details. From the perspective of benchmark results, in the long context scenario of Qwen 2 72b, using H100 and enabling fp8, the performance is far ahead of vLLM. The load testing tool https://github.com/fw-ai/benchmark provided by FireWorks AI is also meaningful for our reference. Currently, due to the ban on sales of H100 in mainland China, developers do not have a suitable development and benchmark environment. But from the results on that blog, it seems very necessary to support fp8. And it is also very necessary to optimize long context inference. @lzhangzz @grimoire @lvhan028
Related resources
No response
Additional context
No response