Open Deltadahl opened 2 months ago
You can try to modify the code and ckpt, but I would advise against it. We spent a lot of effort to align a vision encoder that supports dynamic resolution with the LLM, and switching it to another VE may cause a huge performance decline.
The problem I'm having is that I need to fine tune the full model (including the VE) since I'm using it for a specific medical dataset paried with reports, but in my experiments so far, the model ignores the image completely during inference (trained full model inc. VE for 3 epochs). Do you have any thoughts on how to fine tune the full model to this new domain in a good way? (currently using a dataset of 10k paired image-reports)
The problem I'm having is that I need to fine tune the full model (including the VE) since I'm using it for a specific medical dataset paried with reports, but in my experiments so far, the model ignores the image completely during inference (trained full model inc. VE for 3 epochs). Do you have any thoughts on how to fine tune the full model to this new domain in a good way? (currently using a dataset of 10k paired image-reports)
@simonJJJ Any suggestions?
I am struggling here too. I have 30,000 transcribed medieval manuscript pages that I am trying to fine tune on for improved medieval HTR. Is there a Python script available that shows how to fine-tune the whole model?
Bump. Or is there another way to do this, like weighing the loss on the vision encoder to be larger?
Is it possible to swap out the vision encoder to a custom vision endoer (custom DINOv2 model) for these models and still run the full training script?