Poor performance of serving vision-languange models using batching

wanzhenchn commented 7 months ago

Motivation

I have benchmarked the performance of llava-v1.6-vicuna-13b with api server on 2*A30. The detailed data is as follows.

When using batching, as the batch size increases, the improvement of token QPS becomes less significant, and the speedup ratio shows a decreasing trend.

Compared to serve LLMs, the speedup from using batching is not significant.

Input_Token_Len denotes the prompt token length, which consists of text and image embedded with vl model
No stream mode

@irexyc @lvhan028

follow https://github.com/InternLM/lmdeploy/issues/1316

Related resources

llava-v1.6-vicuna-13b
200 prompts and images from VQAv2.
the profile scripts modified from https://github.com/InternLM/lmdeploy/blob/main/benchmark/profile_restful_api.py

Additional context

No response

irexyc commented 7 months ago

Can you share the used scripts and data ?

wanzhenchn commented 7 months ago

Can you share the used scripts and data ?

test data: vqav2/llava_vqav2_mscoco_test2015.jsonl, https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view
image:wget http://images.cocodataset.org/zips/test2015.zip
code.zip

python3 profile_vl_restful_api.py \
  ${http_server_address} \
  /path/to/llava_vqav2_mscoco_test2015.jsonl \
   --concurrency ${bs} \
   --out_len ${request_out_len} \
   --samples 200 \
   --top_k 3 \
   --top_p 0.95 \
   --temperature 0.0 \
   --repetition_penalty 1.15 \
   --device_id ${monitor_device_id} \
   --log_path 'perf.log'

wanzhenchn commented 7 months ago

Can you share the used scripts and data ?

Have you finished double-checking the performance data? @irexyc

irexyc commented 7 months ago

Hi, I am working on this issues and will give feedkback week.

wanzhenchn commented 7 months ago

Hi, I am working on this issues and will give feedkback week.

Any progress to share? @irexyc

irexyc commented 7 months ago

Sorry for late reply. I test with llava-v1.6-vicuna-7b and below is my test result.

There are some places in your script that need to be modified. If you want to output specific length like 256, you should add ignore_eos': True to the pload, therefore the output will not stop until that length. Moreover, you shouldn't use local filepath in the request, you should use base64 data of image to avoid reading from disk.

The proportion of input token (about 2200) is large compared with output token (257) which limits the speedup ratio, if you use llava1.5 or qwen which use short image features, you will see a better speedup ratio.

wanzhenchn commented 7 months ago

Sorry for late reply. I test with llava-v1.6-vicuna-7b and below is my test result.

There are some places in your script that need to be modified. If you want to output specific length like 256, you should add ignore_eos': True to the pload, therefore the output will not stop until that length. Moreover, you shouldn't use local filepath in the request, you should use base64 data of image to avoid reading from disk.

The proportion of input token (about 2200) is large compared with output token (257) which limits the speedup ratio, if you use llava1.5 or qwen which use short image features, you will see a better speedup ratio.

Thanks for your checking.

the parameter ignore_eos': True is not required to be passed to pload because the number of tokens generated in the actual scenario may not necessarily be the same as the requested quantity.
I have used base64 data of image in scripts above to avoid reading from disk.

However, the token qps re-measured is almost the same as shown in table above based on lmdeploy 0.3.0.

What's the matter? @irexyc

irexyc commented 7 months ago

You can set log_level to INFO and check the server side log.

The red line means the current decoding step has batch size of 128.

The default max_batch_size of TurbomindEngineConfig is 128, but due to cache_max_entry_count and session_len, the actual running batch size may be little.

InternLM / lmdeploy