DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
871 stars 60 forks source link

Can videollama2 continue finetuning on my own dataset using 32 frames? #93

Closed zhengrongz closed 2 months ago

zhengrongz commented 2 months ago

Hi! Thanks for your excellent work! I wonder know whether I can use 32 frames per video to finetune model on my own dataset? If true, do I just need to change the number of sampled frames in constant? Looking forward to your reply!

lixin4ever commented 2 months ago

Yes, I believe it is fine to do so. In our internal evaluations, we found that our video models can generalize well to longer input (i.e., more input frames), and they usually perform better given the longer input.

You can specify this argument explicitly in your own script to support the training with more video frames.

zhengrongz commented 2 months ago

@lixin4ever OK thank you!