NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
1.73k stars 134 forks source link

Whether the visual encoder participates in training #95

Closed LoverLost closed 3 weeks ago

LoverLost commented 1 month ago

In the 3rd stage, following your paper 'Finally, we further perform instruction tuning of the pre-trained model on visual language instruction datasets,' I'm wondering if siglip is also unfrozen and involved in the training during this stage. Or is it only the projector and LLM?

LoverLost commented 1 month ago

If possible, I would also like to know which version of llama is used for the LLM of the 3b model?

Lyken17 commented 3 weeks ago

ViT was during vila training.

For 3b, we used shearedllama from princeton.

Lyken17 commented 3 weeks ago

https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B