[Feature] Support for compact Vision-Language models

vody-am commented 3 months ago

Motivation

Hi friends,

I'm opening this issue as a place to discuss small vision-language models, please share your thoughts below!

There's recently been great success in research with smaller, more compact vision-language models. A few that come to mind are:

Generally speaking they sacrifice some language capability, but require much less VRAM and money to run. I think users (including myself) would be interested in having at least one of these families of models supported in lmdeploy, as many use-cases do not require advanced language capability, so it'd be great to trade some parameters for faster / smaller models. I also think this would pair well with some of the efforts in xtuner as that project contains recipes for generating models of this family type.

As far as I can see, most models are following a variation of Encoder (Clip or SigLIP) followed by a linear layer or MLP, followed by a compact LLM such as Phi (1.3-4B) / Gemma (2B) / etc. Is there interest from folks in supporting something like this? I'm always happy to contribute efforts as much as I can. I think it would be interesting to add support for some of the aforementioned LLMs to the Turbomind layer, although probably the easiest method is the Pytorch backend to start.

Please let me know your thoughts!

Related resources

No response

Additional context

No response

lvhan028 commented 3 months ago

Hi, @vody-am Contributors are always warmly welcomed. It seems that the small and compact vision-language models require Phi and Gemma, which are not supported by the turbomind engine but by the pytorch engine. Although expanding the capability of the turbomind engine to support a broader range of models is a key task, this expansion must be preceded by establishing a solid foundation. Unfortunately, the turbomind engine now clearly is not ready for that.

We plan to refactor the turbomind engine within the next two months, aiming to accomplish this task by the end of July.

So, I think it's better to support the compact vision-language models in the pytorch engine now.

vody-am commented 3 months ago

@lvhan028 :saluting_face: gotcha! I will take a shot at this, as I am pretty interested. I'll experiment for now with the Pytorch side and I'll send a PR in some weeks in case others find it useful as well. If you have any general guidance on how to approach it, please share. As far as I can tell, this document describes how to add new models, and I will follow along with https://github.com/InternLM/lmdeploy/pull/1502 which recently integrated CogVLM .

lvhan028 commented 3 months ago

Yes. The documentation and the CogVLM PR you mentioned can be viewed as the guidance. Look forward to your PR.

InternLM / lmdeploy