Closed pritamqu closed 4 months ago
A similar mismatch is also noticed for finetuning data as well, would it be possible to share the entire training data?
Sorry for the confusion. We may have no plan to release our training data. In the README file, we just provided the pre-training and fine-tuning datasets from Video-LLaVA to demonstrate how to train a VideoLLM with our codebase.
Thanks for letting me know.
In Table 2, it's mentioned that a total of 12.2M vision-text pairs are used. However, the GitHub repo shares 703K video-text and 558K image-text pairs; Can you share the missing pretraining data?