[Benchmark] TurboMind benchmark with GLM-4-9B-Chat and Qwen2-72B-Instruct vs vLLM

zhyncs commented 1 month ago

Motivation

As of July 2024, in the field of open-source LLMs limited to CJK native-supported models, GLM-4-9B-Chat performs best among those with less than 10B parameters. In larger parameter models, Qwen2-72B-Instruct stands out. These conclusions are based on my current research and are widely used by several large internet companies in China.

Currently, these two models are well supported in the excellent inference backend TurboMind of LMDeploy. And TurboMind has excellent compatibility, supporting simultaneous use of AWQ, KV Cache Quant, and Automatic Prefix Cache. The compatibility with NVIDIA GPU drivers has been relatively good, supporting R470 to R535.

Currently, the open-source framework vLLM is the most popular and a very good comparison framework. I plan to use the vLLM benchmark script on the two models mentioned above to benchmark both vLLM and LMDeploy separately.

I believe this performance data will be very meaningful. Because currently, many teams in large domestic companies are doing SFT based on open-source SOTA and deploying them online. They often have very high requirements for performance, and usually deploy at a scale one level higher than small companies or teams. GLM-4-9B-Chat is more suitable for lower-cost scenarios, while Qwen2-72B-Instruct is more suitable for scenarios with higher effectiveness requirements.

This comparison will have a great reference significance for them to make technical selection or technical route, and at the same time, it also promotes LMDeploy to be known by more teams.

I plan to complete the relevant benchmarks by the middle of this month and plan to submit corresponding PR. Do you have any suggestions? @lvhan028 @lzhangzz @irexyc @AllentDan @grimoire @RunningLeon

Please stay tuned. Cheers.

Related resources

No response

Additional context

No response

zhyncs commented 1 month ago

Here is a summary of some relevant links.

blog https://www.bentoml.com/blog/benchmarking-llm-inference-backends

benchmark script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py https://github.com/fw-ai/benchmark/blob/main/llm_bench/load_test.py

SOTA AWQ Kernel https://github.com/InternLM/lmdeploy/pull/202

Automatic Prefix Caching https://github.com/InternLM/lmdeploy/pull/1450

Online KV Cache Quant https://github.com/InternLM/lmdeploy/pull/1377

TurboMind Attention https://github.com/InternLM/lmdeploy/pull/1116

GLM-4-9B-Chat support https://github.com/InternLM/lmdeploy/pull/1724

Optimization to come(GEMM optimization, MOE support in TurboMind) https://github.com/InternLM/lmdeploy/issues/1970#issuecomment-2217138900

Tendo33 commented 1 month ago

Is there a more comprehensive review comparing these models like qwen2-7b, glm4-9b, yi1.5-9b, internlm2.5-7b

zhyncs commented 1 month ago

Is there a more comprehensive review comparing these models like qwen2-7b, glm4-9b, yi1.5-9b, internlm2.5-7b

Sorry, as this is an internal test, I cannot provide specific data on model performance comparisons. You can also conduct your own validation based on your specific scenario.

zhyncs commented 1 month ago

Using the SOTA within 10B parameters, namely GLM 4 9B Chat, tested on a single A100 80G. The benchmark script used is from vLLM, with 1000 prompts. LMDeploy and vLLM both use the latest version.

zhyncs commented 1 month ago

Qwen 2 72B Instruct TP 4

zhyncs commented 1 month ago

I don't intend to submit a PR for the images from this benchmark conclusion for now. I think we've met our goal of understanding how LMDeploy and vLLM perform under various request rates. This should be a useful reference if other users are interested in these two SOTA Chinese models. I'll close this issue for now, and it can be reopened or discussed further if necessary.

InternLM / lmdeploy