Closed zhyncs closed 7 months ago
Hi, @zhyncs thanks for your suggestion. We've discussed some ideas the other day. Our plan is to develop a kernel package that can be shared by turbomind and pytorch enigne. After that, we might work on its integration to the upstream repository
Hi, @zhyncs thanks for your suggestion. We've discussed some ideas the other day. Our plan is to develop a kernel package that can be shared by turbomind and pytorch enigne. After that, we might work on its integration to the upstream repository
Hi @lvhan028 Thank you for your reply. Does this mean that in the future, the PyTorch Engine can use kernels developed by Triton as well as those from TurboMind? If so, does it imply that the PyTorch Engine will also provide a similar abstraction to choose different kernel implementations? If that's the case, it should be easy to integrate other backends like FlashInfer in the future. I'm not sure if I understand it correctly.
We haven't discussed the detail yet. Let us do more investigation and we'll let you know if there is any update.
Hi @lvhan028 We have a plan internally to implement prefix cache on LMDeploy. Currently, the SOTA implementation we are aware of in FlashInfer is https://flashinfer.ai/2024/02/02/cascade-inference.html, so if we can integrate it, I think it will save a lot of work.
Kernel Python Package is planned to be released in the next three months.
Kernel Python Package is planned to be released in the next three months.
Based on my previous communication with @lzhangzz , use TurboMind implementation's Attention as a lib
was planned, but now we cannot confirm the specific time. Please stay tuned. It is estimated that if SGLang wants to add support for TurboMind backend in the future, maybe it can be similar to inheriting FlashInfer. cc @merrymercy @Ying1123
Based on the benchmarking results from https://bentoml.com/blog/benchmarking-llm-inference-backends and ScaleLLM, it can be seen that both MLC-LLM and ScaleLLM use FlashInfer as the backend. Compared to TurboMind, TurboMind still has some performance advantages. Besides that, LMDeploy has excellent compatibility with features and CUDA drivers. ref https://github.com/InternLM/lmdeploy/pull/1946
# TurboMind
python3 -m lmdeploy serve api_server /workdir/Llama-2-13b-chat-hf
# ScaleLLM
python3 -m scalellm.serve.api_server --model /workdir/Llama-2-13b-chat-hf
# TurboMind
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 23333 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model /workdir/Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1000 --request-rate 128
# ScaleLLM
python3 benchmark_serving.py --backend openai --host 127.0.0.1 --port 8080 --dataset /workdir/ShareGPT_V3_unfiltered_cleaned_split.json --model Llama-2-13b-chat-hf --tokenizer /workdir/Llama-2-13b-chat-hf --num-prompts 1000 --request-rate 128
# TurboMind
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 108.19
Total input tokens: 245995
Total generated tokens: 196273
Request throughput (req/s): 9.24
Input token throughput (tok/s): 2273.65
Output token throughput (tok/s): 1814.09
---------------Time to First Token----------------
Mean TTFT (ms): 29953.87
Median TTFT (ms): 27039.17
P99 TTFT (ms): 80858.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 65.79
Median TPOT (ms): 59.59
P99 TPOT (ms): 278.76
---------------Inter-token Latency----------------
Mean ITL (ms): 215.63
Median ITL (ms): 44.80
P99 ITL (ms): 290.98
==================================================
# ScaleLLM
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 115.64
Total input tokens: 245995
Total generated tokens: 195334
Request throughput (req/s): 8.65
Input token throughput (tok/s): 2127.27
Output token throughput (tok/s): 1689.17
---------------Time to First Token----------------
Mean TTFT (ms): 33849.20
Median TTFT (ms): 33777.98
P99 TTFT (ms): 84234.96
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 65.95
Median TPOT (ms): 66.42
P99 TPOT (ms): 86.23
---------------Inter-token Latency----------------
Mean ITL (ms): 237.65
Median ITL (ms): 102.24
P99 ITL (ms): 114.65
==================================================
Motivation
As we know, the current performance of LMDeploy PyTorch Engine slightly outperforms vLLM in throughput. Currently, vLLM is being refactored and aims to integrate with FlashInfer. FlashInfer has outstanding performance and is expected to bring throughput benefits to vLLM. In addition, many other LLM serving frameworks also use FlashInfer, such as mlc-llm and sglang. I am wondering if the LMDeploy PyTorch Engine, which emphasizes flexibility and ease of use, can consider providing convenience for integrating other backends through interfaces, similar to this kind of abstraction: https://github.com/vllm-project/vllm/pull/3462 If it is worth the time to do this, perhaps we will also assist or take the lead in this matter.
Do you have any any suggestions? Thanks. @lvhan028 @grimoire @RunningLeon
Related resources
No response
Additional context
No response