DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
871 stars 60 forks source link

Doubt regarding training data #54

Closed pritamqu closed 4 months ago

pritamqu commented 4 months ago

In Table 2, it's mentioned that a total of 12.2M vision-text pairs are used. However, the GitHub repo shares 703K video-text and 558K image-text pairs; Can you share the missing pretraining data?

pritamqu commented 4 months ago

A similar mismatch is also noticed for finetuning data as well, would it be possible to share the entire training data?

lixin4ever commented 4 months ago

Sorry for the confusion. We may have no plan to release our training data. In the README file, we just provided the pre-training and fine-tuning datasets from Video-LLaVA to demonstrate how to train a VideoLLM with our codebase.

pritamqu commented 4 months ago

Thanks for letting me know.