OpenGVLab / InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
https://internvl.readthedocs.io/en/latest/
MIT License
5.4k stars 421 forks source link

InternVL-det? #135

Closed fyting closed 1 month ago

fyting commented 4 months ago

Would it be possible to enhance the detection capability of InternVL by incorporating more data combined with grounding instructions during the fine-tuning stage?

whai362 commented 4 months ago

Interesting idea! We haven't tried it yet. You can test this idea by combining InternViT-1.5 and ViTDet (https://github.com/ViTAE-Transformer/ViTDet) or ViT-Adapter (https://github.com/czczup/ViT-Adapter).

fyting commented 4 months ago

Every time we adjust the visual encoder, we have to go through a long pre-training process (freezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) plus SFT (unfreezing the LLM, unfreezing the MLP, using around 1.2M data from the v1.2 version). Can we directly use the SFT data (unfreezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) to train the model, in order to speed up the process of adjusting the visual encoder? @whai362 @czczup

G-z-w commented 1 month ago

Hi, It is still necessary to pretrain and align the MLP first. A more efficient training solution will leave for future work