NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
1.85k stars 149 forks source link

Updated paper on the latest model (video understanding, etc.) #38

Open thecooltechguy opened 5 months ago

thecooltechguy commented 5 months ago

Congrats on adding support for video understanding to VILA, looks super cool!

Just curious, is there an updated or new paper with more technical details on how improved video understanding was added to the VILA model?

Thanks!

Lyken17 commented 5 months ago

Hi @thecooltechguy

The main benefits comes from training data improvement during the pre-training.

We are working on techinical papers and plan to reveal more details once ready :)

hkunzhe commented 4 months ago

@Lyken17, Great work! Looking forward to the technical paper!

hkunzhe commented 4 months ago

@Lyken17 Hi, I noticed that the paper was updated a few days ago, but it still does not mention the capability for video understanding. After comparing VILA's initial submission and version 1.5, I found that the pre-training dataset only added ShareGPT4v, while in SFT, video-related datasets such as shot2story/ShareGPT4Video were added. Moreover, the model was switched from llama2 + clip to llama3 + siglip/internvit. Could you elaborate on that in more detail?

Lyken17 commented 3 months ago

We will release the arxiv in sometime in the July. Stay tuned :)