InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.5k stars 406 forks source link

Comparasion with vllm #99

Closed lucasjinreal closed 9 months ago

lucasjinreal commented 1 year ago

vllm can boost up to 24x compare with vanilla llama version, does lmdeploy have any speed test compare with it?

KimmiShi commented 1 year ago

Mark, I tested Vllm, about 2x the performance of deepspeed.

KimmiShi commented 1 year ago

@lucasjinreal why close this ...?

lucasjinreal commented 1 year ago

I have using vllm got a reasonable performance, there is no need to adopt other framework. Furthermore, one can expect TensorRTLM next month, which could be a killer in LLM faster inference.

wangruohui commented 1 year ago

Hi @KimmiShi,

Can you share some information about your test setting? Based on our setting, vllm is slower than deepspeed.

  1. vllm's default benchmark uses sharegpt data and include input token in throughput computation.
  2. For deepspeed and vllm, we use a batch of fake data and let the model generate to its max length (2048 new tokens for llama-7b), and compute only throughput of newly generated data. Vllm is only about half throughput of deepspeed.
lucasjinreal commented 1 year ago

It's impossible deepspeed faster than vllm, they are essentially different things. Just try PagedAtten, it's way more faster than you think.

wangruohui commented 1 year ago

It's impossible deepspeed faster than vllm, they are essentially different things. Just try PagedAtten, it's way more faster than you think.

How much throughput do you achieve with vllm (say llama 7b on single A100)?

lvhan028 commented 1 year ago

@lucasjinreal I think continuous batching contributes much more to throughput than page attention, just like lmdeploy's persistent batch did

MikhaelSorkin commented 1 year ago

I have using vllm got a reasonable performance, there is no need to adopt other framework. Furthermore, one can expect TensorRTLM next month, which could be a killer in LLM faster inference.

Hello, could you please share some information about “ TensorRTLM” project? Haven’t found any info about it: no articles, no announcements - nothing :(

Kevinddddddd commented 1 year ago

I have using vllm got a reasonable performance, there is no need to adopt other framework. Furthermore, one can expect TensorRTLM next month, which could be a killer in LLM faster inference.

Hello, could you please share some information about “ TensorRTLM” project? Haven’t found any info about it: no articles, no announcements - nothing :(

+1

sleepwalker2017 commented 1 year ago

killer

It's a secret project that has not announced yet. They give some companies tryout. Maybe in this month. But I don't think it any better than FT.

AIApprentice101 commented 1 year ago

vLLM does not support quantization yet. How's the comparison performed? Are we comparing both under float16 precision?

sleepwalker2017 commented 1 year ago

vllm has little optimization about speed. only for throughput.

The advantage is paged attention which uses less GPU memory, and continuous batching which gives higher GPU occupancy.

But as the implementation is almost all in python, the GPU utilization is not high, maybe 80%.

Just test it yourself.

lmdeploy uses more GPU memory, but the kernels are more performant.

ishaan-jaff commented 11 months ago

Hi i'm the maintainer of LiteLLM and we allow you to maximize throughput by load balancing between multiple LLM endpoints. Thought it would be useful for people on this thread, I'd love feedback if not

Here's the quick start, to use LiteLLM load balancer (works with 100+ LLMs) doc: https://docs.litellm.ai/docs/simple_proxy#model-alias

Step 1 Create a Config.yaml

model_list:
- model_name: openhermes
  litellm_params:
      model: openhermes
      temperature: 0.6
      max_tokens: 400
      custom_llm_provider: "openai"
      api_base: http://192.168.1.23:8000/v1
- model_name: openhermes
  litellm_params:
      model: openhermes
      custom_llm_provider: "openai"
      api_base: http://192.168.1.23:8001/v1
- model_name: openhermes
  litellm_params:
      model: openhermes
      custom_llm_provider: "openai"
      frequency_penalty : 0.6
      api_base: http://192.168.1.23:8010/v1

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "openhermes",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'
AlexBlack2202 commented 1 month ago

hello, but is LMDeploy still faster than vllm at this time, in fp16?

lvhan028 commented 1 month ago

Here is the benchmark result on A100(80G) with llama3-8b. image image image image