In the 3rd stage, following your paper 'Finally, we further perform instruction tuning of the pre-trained model on visual language instruction datasets,' I'm wondering if siglip is also unfrozen and involved in the training during this stage. Or is it only the projector and LLM?
In the 3rd stage, following your paper 'Finally, we further perform instruction tuning of the pre-trained model on visual language instruction datasets,' I'm wondering if siglip is also unfrozen and involved in the training during this stage. Or is it only the projector and LLM?