LLaVA-VL / LLaVA-NeXT

Apache License 2.0
3k stars 258 forks source link

How to access video data in LLaVA-OneVision? #190

Open xsgldhy opened 3 months ago

xsgldhy commented 3 months ago

Thank you for your contribution. Under the huggingface lmms-lab/LLaVA-OneVision-Data repo, I find that there are only single-image data, and in your scripts/train/README.md, you say that the video incorporates Youcook2 (32267), Charades (19851), NextQA (7653), activitynet (5153), ego4d (671), but under huggingface lmms-lab repo, I cannot find ego4d dataset, and Youcook2 only has val and test split, which is less than the number reported in the paper(41.9k samples). Does anyone know how to find those video data annotated in the llava format?

xsgldhy commented 3 months ago

I am also a little confused with the relationship between OneVision data and M4-Instruct-Data. Does OneVision contain the total of M4-Instruct? image I assume that the 560k multi-image samples are a subset of M4-Instruct? But M4-Instruct contains 615k multi-image samples, how can I find the 560k subset? And lmms-lab/LLaVA-OneVision-Data repo actually contains "Single-Image 3.2M", not the "OneVision 1.6M"? So how can I find the "800K higher-quality data re-sampled from previous stage"?

yongliang-wu commented 2 months ago

Hi xsgldhy, have you solved this problem?