Support for using another image encoder?

InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.

1.92k stars 121 forks source link

Support for using another image encoder? #252

Closed XKCUW closed 2 months ago

XKCUW commented 2 months ago

InternLM-XComposer2-vl-7b uses 'clip-vit-large-patch14-336' as encoder. If I want to make the model support higher pixel images, whether I can use another image encoder? As I known, DeepSeek-vl support the input image on 1024 * 1024.

LightDXY commented 2 months ago

hi, please try our new model with 4khd resolution. https://huggingface.co/internlm/internlm-xcomposer2-4khd-7b

XKCUW commented 2 months ago

hi, please try our new model with 4khd resolution. https://huggingface.co/internlm/internlm-xcomposer2-4khd-7b

thanks for your reply, actually I found the new 4khd model, however my workspace can not support the execution environment. Thus, back to the original question, can I use another image encoder instead?

LightDXY commented 2 months ago

XComposer is a general framework, so you could train the model with a new vision encoder. btw, the Xcomposer2-4khd model uses the same environment as the Xcomposer2, why it is not supported?

XKCUW commented 2 months ago

XComposer is a general framework, so you could train the model with a new vision encoder. btw, the Xcomposer2-4khd model uses the same environment as the Xcomposer2, why it is not supported?

4khd requires flash-attention which needs cuda11.6 or a higher version

LightDXY commented 2 months ago

ok, I got it, you could use 4khd without the flash-attn with a lower resolution, such as HD16 with 1344x1344 input

XKCUW commented 2 months ago

ok, I got it, you could use 4khd without the flash-attn with a lower resolution, such as HD16 with 1344x1344 input

Really? I think it is good news for me. Could you please shown me some examples to instruct how to adjust the parameters or something?

LightDXY commented 2 months ago

For the .chat() function, you could change hd_num=55 to a smaller number for smaller input.

For advanced usage, you could define the text and image, and the image resolution freely with this function

our model is flexible toward input resolution, the model.hd_num means the max image patches, for example, if you set it to 9, it allows 1008*1008 image, if set it to 16, it could be 1344x1344, and the 55 patches allow extreme large image such as 4K HD images. For most cases, 25 patches is good enough. Please refer to our paper https://arxiv.org/abs/2404.06512 for more details.

XKCUW commented 2 months ago

For the .chat() function, you could change hd_num=55 to a smaller number for smaller input.

For advanced usage, you could define the text and image, and the image resolution freely with this function

our model is flexible toward input resolution, the model.hd_num means the max image patches, for example, if you set it to 9, it allows 1008*1008 image, if set it to 16, it could be 1344x1344, and the 55 patches allow extreme large image such as 4K HD images. For most cases, 25 patches is good enough. Please refer to our paper https://arxiv.org/abs/2404.06512 for more details.

Thanks a lot! I'll try the new way on this model, and I will add the feedback soon!