OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.26k stars 414 forks source link

InternVL−Chat−V1.5-Int8 inference is slow #250

Closed tairen99 closed 1 month ago

tairen99 commented 2 months ago

Hi @all Thank you for your good work!

I notice using the InternVL−Chat−V1.5-Int8 model, the inference time is very slow, as mentioned in link

Do you support using lmdeploy to improve the inference speed or or any method to improve the inference speed?

Thanks in advance.

BIGBALLON commented 2 months ago

Using the 4-bit model quantized by AWQ is recommended, which is very fast and occupies less GPU memory than int8.

from lmdeploy import pipeline
from lmdeploy.messages import TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL-Chat-V1-5-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
backend_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline(model, backend_config=backend_config, log_level='INFO')
response = pipe(('describe this image', image))
print(response)

or service

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5-AWQ --backend turbomind --model-format awq