InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
1.92k stars 121 forks source link

Questions about img_size (224, 224) and (336, 336) #282

Closed cocoshe closed 2 months ago

cocoshe commented 2 months ago

Tks for your great work!

I noticed the resize_pos function in internlm-xcomposer-7b

https://huggingface.co/internlm/internlm-xcomposer2-7b/blob/main/build_mlp.py#L68-L105

it seems that the code builds on the basis: "There are only 224 * 224 images that are supported, which means if the img_size is 224 in finetune.sh, the resize_pos in the init func of CLIPVisionTower will be executed, which also means that if the img_size in finetune.sh is not set to be 224, the finetune can not be executed successfully(I noticed this PR)"

I want to check if my judgements above are right?

If so, I am curious about the purpose of resize operation, because the vit model is set to be openai/clip-vit-large-patch14-336 in the code, which indicate the pos_embedding actually is for (336, 336) scaled images. but the resize_pos function just resize the pos_embedding layer to support the (224, 224) scale images instead of supporting the origin (336, 336) with openai/clip-vit-large-patch14-336 modeling? Or if you want to support (224, 224), why not use something like openai/clip-vit-large-patch14-**224** as a pretrained model?

yuhangzang commented 2 months ago
  1. We select different image resolutions for different models. The img_size is 224 for the -7b model, 490 for the -vl-7b model and not necessary for the -4khd-7b model.

  2. We select openai/clip-vit-large-patch14-336 since it is a stronger visual encoder than other variants.

cocoshe commented 2 months ago
  1. We select different image resolutions for different models. The img_size is 224 for the -7b model, 490 for the -vl-7b model and not necessary for the -4khd-7b model.
  2. We select openai/clip-vit-large-patch14-336 since it is a stronger visual encoder than other variants.

Tks for your reply! So since the embedding layer is trainable when sft, the img_size actually doesn't affect the performance though the pretrained support for 336 * 336 in clip-vit-large-patch14-336 is abandoned?

yuhangzang commented 2 months ago

During our pre-training stage, both the vision encoder and Partial LoRA are fine-tuned, which will mitigate your concern about img_size misalignment.

cocoshe commented 2 months ago

During our pre-training stage, both the vision encoder and Partial LoRA are fine-tuned, which will mitigate your concern about img_size misalignment.

OK, Thanks for your reply!