PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.88k stars 207 forks source link

About class embedding #165

Open feiyu12138 opened 3 months ago

feiyu12138 commented 3 months ago

Congrats on fantastic work!! After reading the code, I'm a little bit confused about the approach to encoding videos. Specifically, could I ask the reason why you choose keep the class embedding with patch embedding though the LLaVA use only patching embedding? Appreciate it so much if you can resolve my confusion!