Swap out Vision encoder?

QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Apache License 2.0

3.3k stars 204 forks source link

Swap out Vision encoder? #114

Open Deltadahl opened 2 months ago

Deltadahl commented 2 months ago

Is it possible to swap out the vision encoder to a custom vision endoer (custom DINOv2 model) for these models and still run the full training script?

logicwong commented 2 months ago

You can try to modify the code and ckpt, but I would advise against it. We spent a lot of effort to align a vision encoder that supports dynamic resolution with the LLM, and switching it to another VE may cause a huge performance decline.

Deltadahl commented 2 months ago

The problem I'm having is that I need to fine tune the full model (including the VE) since I'm using it for a specific medical dataset paried with reports, but in my experiments so far, the model ignores the image completely during inference (trained full model inc. VE for 3 epochs). Do you have any thoughts on how to fine tune the full model to this new domain in a good way? (currently using a dataset of 10k paired image-reports)

logicwong commented 2 months ago

The problem I'm having is that I need to fine tune the full model (including the VE) since I'm using it for a specific medical dataset paried with reports, but in my experiments so far, the model ignores the image completely during inference (trained full model inc. VE for 3 epochs). Do you have any thoughts on how to fine tune the full model to this new domain in a good way? (currently using a dataset of 10k paired image-reports)

@simonJJJ Any suggestions?

wjbmattingly commented 2 months ago

I am struggling here too. I have 30,000 transcribed medieval manuscript pages that I am trying to fine tune on for improved medieval HTR. Is there a Python script available that shows how to fine-tune the whole model?

Deltadahl commented 2 months ago

Bump. Or is there another way to do this, like weighing the loss on the vision encoder to be larger?