This is an interesting blog post FireAttention V2: 12x faster to make Long Contexts practical for Online Inference, which hardly reveals any technical details. From the perspective of benchmark results, in the long context scenario of Qwen 2 72b, using H100 and enabling fp8, the performance is far ahead of vLLM. The load testing tool https://github.com/fw-ai/benchmark provided by FireWorks AI is also meaningful for our reference. Currently, due to the ban on sales of H100 in mainland China, developers do not have a suitable development and benchmark environment. But from the results on that blog, it seems very necessary to support fp8. And it is also very necessary to optimize long context inference. @lzhangzz @grimoire @lvhan028
Motivation
This is an interesting blog post FireAttention V2: 12x faster to make Long Contexts practical for Online Inference, which hardly reveals any technical details. From the perspective of benchmark results, in the long context scenario of Qwen 2 72b, using H100 and enabling fp8, the performance is far ahead of vLLM. The load testing tool https://github.com/fw-ai/benchmark provided by FireWorks AI is also meaningful for our reference. Currently, due to the ban on sales of H100 in mainland China, developers do not have a suitable development and benchmark environment. But from the results on that blog, it seems very necessary to support fp8. And it is also very necessary to optimize long context inference. @lzhangzz @grimoire @lvhan028
Related resources
No response
Additional context
No response