Closed wanzhenchn closed 6 months ago
Can you share the used scripts and data ?
Can you share the used scripts and data ?
wget http://images.cocodataset.org/zips/test2015.zip
python3 profile_vl_restful_api.py \
${http_server_address} \
/path/to/llava_vqav2_mscoco_test2015.jsonl \
--concurrency ${bs} \
--out_len ${request_out_len} \
--samples 200 \
--top_k 3 \
--top_p 0.95 \
--temperature 0.0 \
--repetition_penalty 1.15 \
--device_id ${monitor_device_id} \
--log_path 'perf.log'
Can you share the used scripts and data ?
Have you finished double-checking the performance data? @irexyc
Hi, I am working on this issues and will give feedkback week.
Hi, I am working on this issues and will give feedkback week.
Any progress to share? @irexyc
Sorry for late reply. I test with llava-v1.6-vicuna-7b and below is my test result.
There are some places in your script that need to be modified. If you want to output specific length like 256, you should add ignore_eos': True
to the pload, therefore the output will not stop until that length. Moreover, you shouldn't use local filepath in the request, you should use base64 data of image to avoid reading from disk.
The proportion of input token (about 2200) is large compared with output token (257) which limits the speedup ratio, if you use llava1.5 or qwen which use short image features, you will see a better speedup ratio.
Sorry for late reply. I test with llava-v1.6-vicuna-7b and below is my test result.
There are some places in your script that need to be modified. If you want to output specific length like 256, you should add
ignore_eos': True
to the pload, therefore the output will not stop until that length. Moreover, you shouldn't use local filepath in the request, you should use base64 data of image to avoid reading from disk.The proportion of input token (about 2200) is large compared with output token (257) which limits the speedup ratio, if you use llava1.5 or qwen which use short image features, you will see a better speedup ratio.
Thanks for your checking.
ignore_eos': True
is not required to be passed to pload because the number of tokens generated in the actual scenario may not necessarily be the same as the requested quantity.However, the token qps re-measured is almost the same as shown in table above based on lmdeploy 0.3.0.
What's the matter? @irexyc
You can set log_level
to INFO
and check the server side log.
The red line means the current decoding step has batch size of 128.
The default max_batch_size
of TurbomindEngineConfig is 128, but due to cache_max_entry_count
and session_len
, the actual running batch size may be little.
Motivation
I have benchmarked the performance of llava-v1.6-vicuna-13b with api server on 2*A30. The detailed data is as follows.
When using batching, as the batch size increases, the improvement of token QPS becomes less significant, and the speedup ratio shows a decreasing trend.
Compared to serve LLMs, the speedup from using batching is not significant.
Input_Token_Len
denotes the prompt token length, which consists of text and image embedded with vl model@irexyc @lvhan028
follow https://github.com/InternLM/lmdeploy/issues/1316
Related resources
Additional context
No response