Efficient-Large-Model / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)
Apache License 2.0
879 stars 55 forks source link

How does the VILA preprocessed video? #66

Open MonolithFoundation opened 1 month ago

MonolithFoundation commented 1 month ago

Hi, looks like VILA trained a lot of videos, how does the being sampled? And how does it dealed with S2?

yaolug commented 1 month ago

For this release, we use uniform sampling. Each frame can go through S2 w/o problem.