[Feature] SGLang support

zhyncs commented 3 months ago

Hi all @parano @ssheng @larme @bojiang

We (SGLang Team) recently released the exciting v0.2 and wrote a technical blog Achieving Faster Open-Source Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) on performance benchmark, which, on Llama 3.1, from 8B to 405B, from BF16 to FP8, leads comprehensively over vLLM in both throughput and latency.

May you consider adding support for SGLang? Thank you very much, looking forward to your reply. cc @merrymercy @Ying1123 @hnyls2002

bojiang commented 3 months ago

Glad to see the result. @zhyncs We know lmdeploy well. With lmdeploy available, what was the reason for the initiation of SGLang? What are the positioning differences between them in terms of their products?

zhyncs commented 3 months ago

Hi @bojiang LMDeploy TurboMind and SGLang are both excellent projects, and I am a committer to both of them. They both have very outstanding performance and very good ease of use. SGLang uses the Attention Kernel from FlashInfer, and LMDeploy TurboMind uses the Attention Kernel from TurboMind Attention. They are among the strongest implementations I have seen in open-source projects. SGLang is a pure Python project with excellent extensibility. It has far superior performance to vLLM while maintaining the same ease of use and extensibility as vLLM.

bojiang commented 3 months ago

Just curious. It seems that vllm also supports flashinfer. How does SGLang overrun it?

zhyncs commented 3 months ago

Just curious. It seems that vllm also supports flashinfer. How does SGLang overrun it?

We have done a lot of engineering optimization in SGLang, including but not limited to efficient scheduling, memory management, CUDA Graph, Torch Compile, multi-node optimization, etc. Moreover, our benchmark conclusions and data have been verified by well-known large companies. We have provided a complete reproducible method. Welcome to try it. Thanks.

bojiang commented 3 months ago

Sure. Actually we are facing a issue that, vllm is not that good at awq models. (It is fast but not the fastest). Would SGLang help on awq LLMs? If SGlang helps, we can use SGlang for all AWQ models and colab on a social post to share the result. @zhyncs

bojiang commented 3 months ago

https://github.com/bentoml/openllm-models/tree/main/src/vllm-chat Would you like to contribute a sglang-chat like this? I believe most code can be shared.

zhyncs commented 3 months ago

Sure. Actually we are facing a issue that, vllm is not that good at awq models. (It is fast but not the fastest). Would SGLang help on awq LLMs? If SGlang helps, we can use SGlang for all AWQ models and colab on a social post to share the result. @zhyncs

Hi @bojiang The best implementation of AWQ should be LMDeploy. Currently, the AWQ of vLLM has recently been upgraded based on Marlin, achieving Marlin AWQ, which performs better than before but is definitely still not as good as LMDeploy. Moreover, LMDeploy has recently made new optimizations in quantization, and the PR is still under review.

Regarding whether SGLang will be helpful on AWQ LLMs, the answer is yes. We currently use vLLM as a kernel library for quantization. By the way, in fact, the FP8's kernel was written by the CUTLASS team, vLLM adapted it and integrated it. Due to the scheduling and various engineering optimization advantages of SGLang, its performance significantly exceeds that of vLLM when using the same AWQ kernel. You can verify this through actual benchmarks.

If you encounter any problems during this process and need to discuss or seek help, please feel free to contact us at any time. We are more than happy to offer support.Thanks.

zhyncs commented 3 months ago

https://github.com/bentoml/openllm-models/tree/main/src/vllm-chat Would you like to contribute a sglang-chat like this? I believe most code can be shared.

Hi @bojiang Thank you for your suggestion. Our team is small and primarily focused on developing new features and optimizing performance, aiming to remain top-tier. Consequently, we may not have the resources to integrate third-party projects at this time. We encourage these projects to be self-supporting, but should any issues arise, we're more than willing to assist. Also, you're welcome to join our Slack channel. Thank you!

zhyncs commented 3 months ago

python -m sglang.launch_server --model-path  casperhansen/llama-3-8b-instruct-awq --enable-torch-compile --disable-radix-cache --disable-cuda-graph

python -m vllm.entrypoints.openai.api_server --model  casperhansen/llama-3-8b-instruct-awq --disable-log-requests

python3 -m sglang.bench_serving --backend sglang

python3 -m sglang.bench_serving --backend vllm

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  52.28
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    199010
Request throughput (req/s):              19.13
Input token throughput (tok/s):          4093.40
Output token throughput (tok/s):         3821.61
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31247.52
Median E2E Latency (ms):                 30287.03
---------------Time to First Token----------------
Mean TTFT (ms):                          11067.20
Median TTFT (ms):                        11152.41
P99 TTFT (ms):                           18850.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          331.46
Median TPOT (ms):                        116.50
P99 TPOT (ms):                           3110.57
---------------Inter-token Latency----------------
Mean ITL (ms):                           161.30
Median ITL (ms):                         66.04
P99 ITL (ms):                            342.74
==================================================

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Successful requests:                     1000
Benchmark duration (s):                  123.50
Total input tokens:                      213987
Total generated tokens:                  199779
Total generated tokens (retokenized):    199477
Request throughput (req/s):              8.10
Input token throughput (tok/s):          1732.64
Output token throughput (tok/s):         1617.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   64360.23
Median E2E Latency (ms):                 65782.80
---------------Time to First Token----------------
Mean TTFT (ms):                          38066.82
Median TTFT (ms):                        34407.91
P99 TTFT (ms):                           91629.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          152.29
Median TPOT (ms):                        149.71
P99 TPOT (ms):                           373.40
---------------Inter-token Latency----------------
Mean ITL (ms):                           322.74
Median ITL (ms):                         109.27
P99 ITL (ms):                            550.46
==================================================

Hi @bojiang I briefly conducted a benchmark on the AWQ Marlin and hope it can serve as a reference for you. Thanks.

zhyncs commented 3 months ago

The previous description was not very accurate. The performance of AWQ Marlin is actually quite good. Nerual Magic has implemented support for AWQ based on Marlin, and they have done a fantastic job. The SGLang's benchmark results are almost close to those of LMDeploy AWQ.

zhyncs commented 3 months ago

Hi @bojiang I'll close this issue for now since I believe I've synchronized all the possible information. I am looking forward to your next steps, including adoption, and future collaboration. If necessary, we can reopen this issue. Thank you.

bentoml / openllm-models

[Feature] SGLang support #7