OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.48k stars 425 forks source link

What is pretrain_mm_mlp_adapter checkpoints and how do I obtain it? #83

Closed dszpr closed 4 months ago

dszpr commented 5 months ago

Hi! Thanks for the great work!

In the InternVL\internvl_chat_llava\scripts_internvl\finetune_internvit6b_224to336_vicuna7b.sh script, I noticed that there is an arg '--pretrain_mm_mlp_adapter', and I don't know exactly what it means.

I have downloaded the pretrained checkpoints: InternVL-Chat-ViT-6B-Vicuna-7B InternVL-Chat-ViT-6B-Vicuna-13B intern_vit_6b_224px vicuna-7b-v1.5 vicuna-13b-v1.5

I just want to do stage 3, which is internvl_chat_llava finetune. And I thought the '--pretrain_mm_mlp_adapter' should be the 'InternVL-Chat-ViT-6B-Vicuna-7B' or 'InternVL-Chat-ViT-6B-Vicuna-13B',so I just pass the absolute path to the arg, namely ‘--pretrain_mm_mlp_adapter /workspace/code/InternVL/ckpts/InternVL-Chat-ViT-6B-Vicuna-7B/’. However, an error occurs: 'IsADirectoryError: [Errno 21] Is a directory: '/workspace/code/InternVL/ckpts/InternVL-Chat-ViT-6B-Vicuna-7B/''.

So, what is pretrain_mm_mlp_adapter checkpoints and how do I obtain it if I only want to do stage_3-finetune? And what is 'InternVL-Chat-ViT-6B-Vicuna-7B' or 'InternVL-Chat-ViT-6B-Vicuna-13B' checkpoints used for? @czczup @whai362

dszpr commented 5 months ago

Besides, is the InternVL-Chat-LLaVa suitable for nuscenes-VQA task?

czczup commented 5 months ago

Hi, thanks for your attention.

I just uploaded the pre-trained mlp_adapter to huggingface:

OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-7B: link

OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B: link

OpenGVLab/InternVL-Chat-ViT-6B-Vicuna-13B-448px: link

czczup commented 5 months ago

Does nuscenes-VQA need to process multiple images for each sample?

dszpr commented 5 months ago

Yes, for each sample, namely each key_frame, there are 6 images to be processed. So do I need to modify the dataloader code? @czczup

czczup commented 5 months ago

Yes, supporting 6 images at a time requires you to make some changes to the code, regardless of what ViT (CLIP-ViT-L or InternViT-6B) is used.