Congrats on fantastic work!! After reading the code, I'm a little bit confused about the approach to encoding videos. Specifically, could I ask the reason why you choose keep the class embedding with patch embedding though the LLaVA use only patching embedding? Appreciate it so much if you can resolve my confusion!
Congrats on fantastic work!! After reading the code, I'm a little bit confused about the approach to encoding videos. Specifically, could I ask the reason why you choose keep the class embedding with patch embedding though the LLaVA use only patching embedding? Appreciate it so much if you can resolve my confusion!