Open thecooltechguy opened 5 months ago
Hi @thecooltechguy
The main benefits comes from training data improvement during the pre-training.
We are working on techinical papers and plan to reveal more details once ready :)
@Lyken17, Great work! Looking forward to the technical paper!
@Lyken17 Hi, I noticed that the paper was updated a few days ago, but it still does not mention the capability for video understanding. After comparing VILA's initial submission and version 1.5, I found that the pre-training dataset only added ShareGPT4v, while in SFT, video-related datasets such as shot2story/ShareGPT4Video were added. Moreover, the model was switched from llama2 + clip to llama3 + siglip/internvit. Could you elaborate on that in more detail?
We will release the arxiv in sometime in the July. Stay tuned :)
Congrats on adding support for video understanding to VILA, looks super cool!
Just curious, is there an updated or new paper with more technical details on how improved video understanding was added to the VILA model?
Thanks!