PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.88k stars 207 forks source link

Questions about LanguageBind Usage #180

Open lingjunzhao opened 2 months ago

lingjunzhao commented 2 months ago

Hi,

Thanks for releasing the codes! I was reading your paper, but still have some questions about LanguageBind used in Video-LLaVA:

1) Were the weights of the image/video encoder initialized from LanguageBind trainable or frozen, during Video-LLaVA training? 2) Which version of LanguageBind from the model zoo did you initialize the weights from, e.g. LanguageBind_Video_V1.5_FT or LanguageBind_Video_FT?