Closed fyting closed 1 month ago
Interesting idea! We haven't tried it yet. You can test this idea by combining InternViT-1.5 and ViTDet (https://github.com/ViTAE-Transformer/ViTDet) or ViT-Adapter (https://github.com/czczup/ViT-Adapter).
Every time we adjust the visual encoder, we have to go through a long pre-training process (freezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) plus SFT (unfreezing the LLM, unfreezing the MLP, using around 1.2M data from the v1.2 version). Can we directly use the SFT data (unfreezing the LLM, unfreezing the MLP, using around 40M data from the v1.2 version) to train the model, in order to speed up the process of adjusting the visual encoder? @whai362 @czczup
Hi, It is still necessary to pretrain and align the MLP first. A more efficient training solution will leave for future work
Would it be possible to enhance the detection capability of InternVL by incorporating more data combined with grounding instructions during the fine-tuning stage?