Was CLIP ViT-L/14 Freezed in LLaVA-NeXT-Video-DPO (7B)

LLaVA-VL / LLaVA-NeXT

Apache License 2.0

2.52k stars 186 forks source link

Was CLIP ViT-L/14 Freezed in LLaVA-NeXT-Video-DPO (7B) #66

Closed jongwoopark7978 closed 3 months ago

jongwoopark7978 commented 3 months ago

Hi team,

I am currently using LLaVA-NeXT-Video-DPO (7B) and I want to confirm if it uses the pre-trained CLIP ViT-L/14. During training, do you freeze the visual encoder in the same way as in llava1.5? I ask because I hope to use the CLIP text encoder to measure the similarity btw visual and text tokens.

ZhangYuanhan-AI commented 3 months ago

Yes, it was freezed.

jongwoopark7978 commented 3 months ago

Thank you for your quick answer.

Wang-Xiaodong1899 commented 1 month ago

Hi @ZhangYuanhan-AI , you said that visual encoder is frozen, but model.safetensors.index.json in LLaVA-NeXT-Video-DPO(7B) contains the vision_tower keys like following:

So I'm a little confused, what parts are trainable when it comes to training LLaVA-NeXT-Video-DPO(7B)?

Would you mind share the hyper-parameters used to train LLaVA-NeXT-Video-DPO(7B)?

Thanks a lot!

ZhangYuanhan-AI commented 1 month ago

The ckpt is the same as the original CLIP weights