How to deploy InternVL?

OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的可商用开源多模态对话模型

https://internvl.github.io/

MIT License

4.27k stars 325 forks source link

How to deploy InternVL? #122

Closed Iven2132 closed 5 days ago

Iven2132 commented 2 months ago

Great work building InternVL, I'm looking to deploy its inference as an endpoint. I wonder if anyon could help me with that. vLLM and TGI don't support that. What's your suggestion? Please let me know.

utkarsh995 commented 2 months ago

Check https://github.com/InternLM/lmdeploy/pull/1490

Iven2132 commented 2 months ago

Check InternLM/lmdeploy#1490

check https://github.com/InternLM/lmdeploy/issues/1495#:~:text=Also%2C%20I%20got,the%20chat%20template.

lvhan028 commented 2 months ago

LMDeploy v0.4.1 can help deploying InternVL. This is a guide https://github.com/OpenGVLab/InternVL/pull/152

BIGBALLON commented 2 months ago

LMDeploy v0.4.1 can help deploying InternVL. This is a guide #152

hi，@lvhan028 Can you give us some suggestions(tips) for how to use it on V100 GPUs?

lvhan028 commented 2 months ago

When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model.

Although LMDeploy provides robust support for the LLM part of the VLM model on multiple GPUs, it allocates the entire vision part of the VLM model on the 0-th GPU by default. This allocation can lead to insufficient memory for the LLM part on the 0-th GPU, potentially impacting the model's functionality.

So, to deploy VLMs on GPUs with constrained memory capacities, we have to figure out a way to split the vision model into small parts and dispatch them to multiple GPUs.

We are working on this feature and will release it by the end of this month. Stay tuned.

BIGBALLON commented 2 months ago

@lvhan028 If we only consider model inference, which one is faster, LMDeploy or swift?

lvhan028 commented 2 months ago

I didn't find the inference performance benchmark in swift repo.

If only considering the LLM part, LMDeploy can achieve 25 RPS using the SharedGPT dataset, which is nearly 2x faster than vLLM.

But LMDeploy didn't optimize the inference of the vision model. Vision model's optimzation is beyond the scope of LMDeploy and we don't have plans to do that.

BIGBALLON commented 2 months ago

When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model.

Although LMDeploy provides robust support for the LLM part of the VLM model on multiple GPUs, it allocates the entire vision part of the VLM model on the 0-th GPU by default. This allocation can lead to insufficient memory for the LLM part on the 0-th GPU, potentially impacting the model's functionality.

So, to deploy VLMs on GPUs with constrained memory capacities, we have to figure out a way to split the vision model into small parts and dispatch them to multiple GPUs.

We are working on this feature and will release it by the end of this month. Stay tuned.

So if it possible to use a single V100(32G) GPU to deploy InternVL-Chat-V1-5-Int8, if so, how do we set the tp and k/v cache?