Closed Iven2132 closed 5 days ago
LMDeploy v0.4.1 can help deploying InternVL. This is a guide https://github.com/OpenGVLab/InternVL/pull/152
LMDeploy v0.4.1 can help deploying InternVL. This is a guide #152
hi,@lvhan028 Can you give us some suggestions(tips) for how to use it on V100 GPUs?
When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model.
Although LMDeploy provides robust support for the LLM part of the VLM model on multiple GPUs, it allocates the entire vision part of the VLM model on the 0-th GPU by default. This allocation can lead to insufficient memory for the LLM part on the 0-th GPU, potentially impacting the model's functionality.
So, to deploy VLMs on GPUs with constrained memory capacities, we have to figure out a way to split the vision model into small parts and dispatch them to multiple GPUs.
We are working on this feature and will release it by the end of this month. Stay tuned.
@lvhan028 If we only consider model inference, which one is faster, LMDeploy or swift?
I didn't find the inference performance benchmark in swift repo.
If only considering the LLM part, LMDeploy can achieve 25 RPS using the SharedGPT dataset, which is nearly 2x faster than vLLM.
But LMDeploy didn't optimize the inference of the vision model. Vision model's optimzation is beyond the scope of LMDeploy and we don't have plans to do that.
When deploying VLMs on GPUs with limited memory, such as the V100 and T4, it is typically essential to utilize multiple GPUs because of the large size of the model.
Although LMDeploy provides robust support for the LLM part of the VLM model on multiple GPUs, it allocates the entire vision part of the VLM model on the 0-th GPU by default. This allocation can lead to insufficient memory for the LLM part on the 0-th GPU, potentially impacting the model's functionality.
So, to deploy VLMs on GPUs with constrained memory capacities, we have to figure out a way to split the vision model into small parts and dispatch them to multiple GPUs.
We are working on this feature and will release it by the end of this month. Stay tuned.
So if it possible to use a single V100(32G) GPU to deploy InternVL-Chat-V1-5-Int8, if so, how do we set the tp
and k/v cache?
Great work building InternVL, I'm looking to deploy its inference as an endpoint. I wonder if anyon could help me with that. vLLM and TGI don't support that. What's your suggestion? Please let me know.