InternVL−Chat−V1.5-Int8 inference is slow

OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

MIT License

5.26k stars 414 forks source link

Using the 4-bit model quantized by AWQ is recommended, which is very fast and occupies less GPU memory than int8.

https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5-AWQ

from lmdeploy import pipeline
from lmdeploy.messages import TurbomindEngineConfig
from lmdeploy.vl import load_image

model = 'OpenGVLab/InternVL-Chat-V1-5-AWQ'
image = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
backend_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline(model, backend_config=backend_config, log_level='INFO')
response = pipe(('describe this image', image))
print(response)

or service

lmdeploy serve api_server OpenGVLab/InternVL-Chat-V1-5-AWQ --backend turbomind --model-format awq

OpenGVLab / InternVL

InternVL−Chat−V1.5-Int8 inference is slow #250