InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.34k stars 390 forks source link

[Feature] A series of various optimization points #1647

Open zhyncs opened 4 months ago

zhyncs commented 4 months ago

Motivation

When we use LMDeploy for Serving, although throughput is also a concern, more emphasis is placed on throughput under latency constraints with different QPS. This is a performance metric close to real online scenarios because it is an online system. For example, when QPS is 2, what are the FTL and PTL respectively? When QPS is 4, what are the FTL and PTL respectively? Similarly for 2->4->8->16->32.

Currently, both https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py and https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_restful_api.py can measure latency and throughput under different QPS. But we need to rerun many times. What we can continue to do is, firstly, in subsequent statistics and comparisons, attach the latency and throughput under different QPS and secondly, automate this testing process with a statistical method similar to https://github.com/vllm-project/dashboard.

In addition, the observability of engine performance is also very important, such as https://github.com/friendliai/LLMServingPerfEvaluator. It was also mentioned earlier at https://github.com/InternLM/lmdeploy/issues/1510#issue-2267061847.

As a high-performance LLM inference framework, LMDeploy performs exceptionally well on models like Llama2, Llama3, and similar dense models. And there are still significant shortcomings in monitoring, performance observability, MOE optimization https://docs.google.com/presentation/d/1A--47JAK4BJ39t954HyTkvtfwn0fkqtsL8NGFuslReM/edit#slide=id.g2c91788b9bf_90_605 , and model support. The above functions are all very important for a mature, stable, industrial-grade production LLM serving system.

Due to the shift in the team's work focus and the departure of some members, we may only be able to contribute during our spare time. We hope the community will seriously consider these optimization points mentioned above. Thanks.

@lvhan028 @lzhangzz @grimoire @irexyc @AllentDan

Related resources

No response

Additional context

No response

lvhan028 commented 4 months ago

Hi, @zhyncs Thank you very much for your team's great contributions on lmdeploy. We really appreciate your kind sharing and insightful suggestions.

Regarding the benchmark metrics and the observability of engine performance, we'll learn from the outstanding projects that you mentioned.

The performance of the project is our top concern, perhaps without exception. The MOE optimization will be started next month. We'll try our best.

zhyncs commented 4 months ago

Thanks for your reply! Due to personal planning, I plan to work and live abroad, and I intend to resign in July. Several of my colleagues have also recently left Beijing due to family reasons. As a result, the manpower in LLM Inference has become scarce. At the same time, the team is currently exploring scaling laws in the CTR (Click-Through Rate) domain, aiming to introduce LLM related technologies into the recommendation system to improve the effectiveness of the CTR model. Meta has already achieved some phased results in similar work, so there will also be relevant manpower invested in the CTR scaling law project. I have high hopes for the future development of LMDeploy. You are an excellent team of engineers and this is an outstanding project. Cheers!

zhyncs commented 4 months ago

Before I resign, I will sync the latest updates for the implementation of Medusa TreeMask. The basic functionality is complete, and some stability issues are being fixed. Please stay tuned.

vody-am commented 4 months ago

Perhaps relevant under a separate issue, but I would like to chime in that I could help with measurement and performance work if some issues are created and/or discussion is available publicly. I am very interested in using lmdeploy for serving vision-language models, and throughput is a top concern (latency has been good, but perhaps some improvements could be made to improve throughput).

lvhan028 commented 4 months ago

@vody-am that's warmly welcomed

lvhan028 commented 4 months ago

@vody-am could you help review https://github.com/InternLM/lmdeploy/pull/1662? It is about VL serving benchmark. Any idea you have can be raised and discussed in that PR