DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
752 stars 50 forks source link

Can we do the only text, image and text and video and text finetuning with lora in a one run #84

Closed thisurawz1 closed 2 weeks ago

thisurawz1 commented 3 weeks ago

Can we do only text, image-text, and video-text finetuning with Lora in one run? I mean, put only text, text-image, and video-image samples in the same custom.json file and do the fine-tuning?

clownrat6 commented 3 weeks ago

Yes, you can. The LazysupervisedDataset in train.py unifies the processing of pure text, image-text, video-text data sample.