Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
878 stars 55 forks source link

Updated paper on the latest model (video understanding, etc.) #38

Open thecooltechguy opened 1 month ago

thecooltechguy commented 1 month ago

Congrats on adding support for video understanding to VILA, looks super cool!

Just curious, is there an updated or new paper with more technical details on how improved video understanding was added to the VILA model?

Thanks!

Lyken17 commented 1 month ago

Hi @thecooltechguy

The main benefits comes from training data improvement during the pre-training.

We are working on techinical papers and plan to reveal more details once ready :)

hkunzhe commented 1 month ago

@Lyken17, Great work! Looking forward to the technical paper!

hkunzhe commented 1 month ago

@Lyken17 Hi, I noticed that the paper was updated a few days ago, but it still does not mention the capability for video understanding. After comparing VILA's initial submission and version 1.5, I found that the pre-training dataset only added ShareGPT4v, while in SFT, video-related datasets such as shot2story/ShareGPT4Video were added. Moreover, the model was switched from llama2 + clip to llama3 + siglip/internvit. Could you elaborate on that in more detail?

Lyken17 commented 1 week ago

We will release the arxiv in sometime in the July. Stay tuned :)