Closed XKCUW closed 2 months ago
hi, please try our new model with 4khd resolution. https://huggingface.co/internlm/internlm-xcomposer2-4khd-7b
hi, please try our new model with 4khd resolution. https://huggingface.co/internlm/internlm-xcomposer2-4khd-7b
thanks for your reply, actually I found the new 4khd model, however my workspace can not support the execution environment. Thus, back to the original question, can I use another image encoder instead?
XComposer is a general framework, so you could train the model with a new vision encoder. btw, the Xcomposer2-4khd model uses the same environment as the Xcomposer2, why it is not supported?
XComposer is a general framework, so you could train the model with a new vision encoder. btw, the Xcomposer2-4khd model uses the same environment as the Xcomposer2, why it is not supported?
4khd requires flash-attention which needs cuda11.6 or a higher version
ok, I got it, you could use 4khd without the flash-attn with a lower resolution, such as HD16 with 1344x1344 input
ok, I got it, you could use 4khd without the flash-attn with a lower resolution, such as HD16 with 1344x1344 input
Really? I think it is good news for me. Could you please shown me some examples to instruct how to adjust the parameters or something?
For the .chat() function, you could change hd_num=55 to a smaller number for smaller input.
For advanced usage, you could define the text and image, and the image resolution freely with this function
our model is flexible toward input resolution, the model.hd_num means the max image patches, for example, if you set it to 9, it allows 1008*1008 image, if set it to 16, it could be 1344x1344, and the 55 patches allow extreme large image such as 4K HD images. For most cases, 25 patches is good enough. Please refer to our paper https://arxiv.org/abs/2404.06512 for more details.
For the .chat() function, you could change hd_num=55 to a smaller number for smaller input.
For advanced usage, you could define the text and image, and the image resolution freely with this function
our model is flexible toward input resolution, the model.hd_num means the max image patches, for example, if you set it to 9, it allows 1008*1008 image, if set it to 16, it could be 1344x1344, and the 55 patches allow extreme large image such as 4K HD images. For most cases, 25 patches is good enough. Please refer to our paper https://arxiv.org/abs/2404.06512 for more details.
Thanks a lot! I'll try the new way on this model, and I will add the feedback soon!
InternLM-XComposer2-vl-7b uses 'clip-vit-large-patch14-336' as encoder. If I want to make the model support higher pixel images, whether I can use another image encoder? As I known, DeepSeek-vl support the input image on 1024 * 1024.