Closed cocoshe closed 2 months ago
We select different image resolutions for different models. The img_size
is 224 for the -7b
model, 490 for the -vl-7b
model and not necessary for the -4khd-7b
model.
We select openai/clip-vit-large-patch14-336
since it is a stronger visual encoder than other variants.
- We select different image resolutions for different models. The
img_size
is 224 for the-7b
model, 490 for the-vl-7b
model and not necessary for the-4khd-7b
model.- We select
openai/clip-vit-large-patch14-336
since it is a stronger visual encoder than other variants.
Tks for your reply! So since the embedding layer is trainable when sft, the img_size actually doesn't affect the performance though the pretrained support for 336 * 336 in clip-vit-large-patch14-336 is abandoned?
During our pre-training stage, both the vision encoder and Partial LoRA are fine-tuned, which will mitigate your concern about img_size misalignment.
During our pre-training stage, both the vision encoder and Partial LoRA are fine-tuned, which will mitigate your concern about img_size misalignment.
OK, Thanks for your reply!
Tks for your great work!
I noticed the
resize_pos
function ininternlm-xcomposer-7b
https://huggingface.co/internlm/internlm-xcomposer2-7b/blob/main/build_mlp.py#L68-L105
it seems that the code builds on the basis: "There are only 224 * 224 images that are supported, which means if the
img_size
is 224 infinetune.sh
, theresize_pos
in the init func ofCLIPVisionTower
will be executed, which also means that if theimg_size
infinetune.sh
is not set to be 224, the finetune can not be executed successfully(I noticed this PR)"I want to check if my judgements above are right?
If so, I am curious about the purpose of resize operation, because the vit model is set to be
openai/clip-vit-large-patch14-336
in the code, which indicate the pos_embedding actually is for (336, 336) scaled images. but theresize_pos
function just resize the pos_embedding layer to support the (224, 224) scale images instead of supporting the origin (336, 336) withopenai/clip-vit-large-patch14-336
modeling? Or if you want to support (224, 224), why not use something likeopenai/clip-vit-large-patch14-**224**
as a pretrained model?