Closed zhyncs closed 4 months ago
Here is a summary of some relevant links.
blog https://www.bentoml.com/blog/benchmarking-llm-inference-backends
benchmark script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py https://github.com/fw-ai/benchmark/blob/main/llm_bench/load_test.py
SOTA AWQ Kernel https://github.com/InternLM/lmdeploy/pull/202
Automatic Prefix Caching https://github.com/InternLM/lmdeploy/pull/1450
Online KV Cache Quant https://github.com/InternLM/lmdeploy/pull/1377
TurboMind Attention https://github.com/InternLM/lmdeploy/pull/1116
GLM-4-9B-Chat support https://github.com/InternLM/lmdeploy/pull/1724
Optimization to come(GEMM optimization, MOE support in TurboMind) https://github.com/InternLM/lmdeploy/issues/1970#issuecomment-2217138900
Is there a more comprehensive review comparing these models like qwen2-7b, glm4-9b, yi1.5-9b, internlm2.5-7b
Is there a more comprehensive review comparing these models like qwen2-7b, glm4-9b, yi1.5-9b, internlm2.5-7b
Sorry, as this is an internal test, I cannot provide specific data on model performance comparisons. You can also conduct your own validation based on your specific scenario.
Using the SOTA within 10B parameters, namely GLM 4 9B Chat, tested on a single A100 80G. The benchmark script used is from vLLM, with 1000 prompts. LMDeploy and vLLM both use the latest version.
Qwen 2 72B Instruct TP 4
I don't intend to submit a PR for the images from this benchmark conclusion for now. I think we've met our goal of understanding how LMDeploy and vLLM perform under various request rates. This should be a useful reference if other users are interested in these two SOTA Chinese models. I'll close this issue for now, and it can be reopened or discussed further if necessary.
Motivation
As of July 2024, in the field of open-source LLMs limited to CJK native-supported models, GLM-4-9B-Chat performs best among those with less than 10B parameters. In larger parameter models, Qwen2-72B-Instruct stands out. These conclusions are based on my current research and are widely used by several large internet companies in China.
Currently, these two models are well supported in the excellent inference backend TurboMind of LMDeploy. And TurboMind has excellent compatibility, supporting simultaneous use of AWQ, KV Cache Quant, and Automatic Prefix Cache. The compatibility with NVIDIA GPU drivers has been relatively good, supporting R470 to R535.
Currently, the open-source framework vLLM is the most popular and a very good comparison framework. I plan to use the vLLM benchmark script on the two models mentioned above to benchmark both vLLM and LMDeploy separately.
I believe this performance data will be very meaningful. Because currently, many teams in large domestic companies are doing SFT based on open-source SOTA and deploying them online. They often have very high requirements for performance, and usually deploy at a scale one level higher than small companies or teams. GLM-4-9B-Chat is more suitable for lower-cost scenarios, while Qwen2-72B-Instruct is more suitable for scenarios with higher effectiveness requirements.
This comparison will have a great reference significance for them to make technical selection or technical route, and at the same time, it also promotes LMDeploy to be known by more teams.
I plan to complete the relevant benchmarks by the middle of this month and plan to submit corresponding PR. Do you have any suggestions? @lvhan028 @lzhangzz @irexyc @AllentDan @grimoire @RunningLeon
Please stay tuned. Cheers.
Related resources
No response
Additional context
No response