InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.6k stars 420 forks source link

Poor performance of serving vision-languange models using batching #1357

Closed wanzhenchn closed 6 months ago

wanzhenchn commented 7 months ago

Motivation

I have benchmarked the performance of llava-v1.6-vicuna-13b with api server on 2*A30. The detailed data is as follows.

When using batching, as the batch size increases, the improvement of token QPS becomes less significant, and the speedup ratio shows a decreasing trend.

Compared to serve LLMs, the speedup from using batching is not significant.

image

@irexyc @lvhan028

follow https://github.com/InternLM/lmdeploy/issues/1316

Related resources

Additional context

No response

irexyc commented 7 months ago

Can you share the used scripts and data ?

wanzhenchn commented 7 months ago

Can you share the used scripts and data ?

python3 profile_vl_restful_api.py \
  ${http_server_address} \
  /path/to/llava_vqav2_mscoco_test2015.jsonl \
   --concurrency ${bs} \
   --out_len ${request_out_len} \
   --samples 200 \
   --top_k 3 \
   --top_p 0.95 \
   --temperature 0.0 \
   --repetition_penalty 1.15 \
   --device_id ${monitor_device_id} \
   --log_path 'perf.log'
wanzhenchn commented 7 months ago

Can you share the used scripts and data ?

Have you finished double-checking the performance data? @irexyc

irexyc commented 7 months ago

Hi, I am working on this issues and will give feedkback week.

wanzhenchn commented 7 months ago

Hi, I am working on this issues and will give feedkback week.

Any progress to share? @irexyc

irexyc commented 7 months ago

Sorry for late reply. I test with llava-v1.6-vicuna-7b and below is my test result.

image

There are some places in your script that need to be modified. If you want to output specific length like 256, you should add ignore_eos': True to the pload, therefore the output will not stop until that length. Moreover, you shouldn't use local filepath in the request, you should use base64 data of image to avoid reading from disk.

The proportion of input token (about 2200) is large compared with output token (257) which limits the speedup ratio, if you use llava1.5 or qwen which use short image features, you will see a better speedup ratio.

image
wanzhenchn commented 7 months ago

Sorry for late reply. I test with llava-v1.6-vicuna-7b and below is my test result.

image

There are some places in your script that need to be modified. If you want to output specific length like 256, you should add ignore_eos': True to the pload, therefore the output will not stop until that length. Moreover, you shouldn't use local filepath in the request, you should use base64 data of image to avoid reading from disk.

The proportion of input token (about 2200) is large compared with output token (257) which limits the speedup ratio, if you use llava1.5 or qwen which use short image features, you will see a better speedup ratio.

Thanks for your checking.

However, the token qps re-measured is almost the same as shown in table above based on lmdeploy 0.3.0.

What's the matter? @irexyc

irexyc commented 7 months ago

You can set log_level to INFO and check the server side log.

image

The red line means the current decoding step has batch size of 128.

The default max_batch_size of TurbomindEngineConfig is 128, but due to cache_max_entry_count and session_len, the actual running batch size may be little.