Closed lucasjinreal closed 9 months ago
Mark, I tested Vllm, about 2x the performance of deepspeed.
@lucasjinreal why close this ...?
I have using vllm got a reasonable performance, there is no need to adopt other framework. Furthermore, one can expect TensorRTLM next month, which could be a killer in LLM faster inference.
Hi @KimmiShi,
Can you share some information about your test setting? Based on our setting, vllm is slower than deepspeed.
It's impossible deepspeed faster than vllm, they are essentially different things. Just try PagedAtten, it's way more faster than you think.
It's impossible deepspeed faster than vllm, they are essentially different things. Just try PagedAtten, it's way more faster than you think.
How much throughput do you achieve with vllm (say llama 7b on single A100)?
@lucasjinreal I think continuous batching contributes much more to throughput than page attention, just like lmdeploy's persistent batch did
I have using vllm got a reasonable performance, there is no need to adopt other framework. Furthermore, one can expect TensorRTLM next month, which could be a killer in LLM faster inference.
Hello, could you please share some information about “ TensorRTLM” project? Haven’t found any info about it: no articles, no announcements - nothing :(
I have using vllm got a reasonable performance, there is no need to adopt other framework. Furthermore, one can expect TensorRTLM next month, which could be a killer in LLM faster inference.
Hello, could you please share some information about “ TensorRTLM” project? Haven’t found any info about it: no articles, no announcements - nothing :(
+1
killer
It's a secret project that has not announced yet. They give some companies tryout. Maybe in this month. But I don't think it any better than FT.
vLLM does not support quantization yet. How's the comparison performed? Are we comparing both under float16 precision?
vllm has little optimization about speed. only for throughput.
The advantage is paged attention which uses less GPU memory, and continuous batching which gives higher GPU occupancy.
But as the implementation is almost all in python, the GPU utilization is not high, maybe 80%.
Just test it yourself.
lmdeploy uses more GPU memory, but the kernels are more performant.
Hi i'm the maintainer of LiteLLM and we allow you to maximize throughput by load balancing between multiple LLM endpoints. Thought it would be useful for people on this thread, I'd love feedback if not
Here's the quick start, to use LiteLLM load balancer (works with 100+ LLMs) doc: https://docs.litellm.ai/docs/simple_proxy#model-alias
model_list:
- model_name: openhermes
litellm_params:
model: openhermes
temperature: 0.6
max_tokens: 400
custom_llm_provider: "openai"
api_base: http://192.168.1.23:8000/v1
- model_name: openhermes
litellm_params:
model: openhermes
custom_llm_provider: "openai"
api_base: http://192.168.1.23:8001/v1
- model_name: openhermes
litellm_params:
model: openhermes
custom_llm_provider: "openai"
frequency_penalty : 0.6
api_base: http://192.168.1.23:8010/v1
litellm --config /path/to/config.yaml
curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
"model": "openhermes",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}
'
hello, but is LMDeploy still faster than vllm at this time, in fp16?
Here is the benchmark result on A100(80G) with llama3-8b.
vllm can boost up to 24x compare with vanilla llama version, does lmdeploy have any speed test compare with it?