DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
907 stars 60 forks source link

An error occurred while loading video.json and audio.json #122

Closed sjghh closed 1 week ago

sjghh commented 1 week ago

Hello, I encountered an issue while fine-tuning using /scripts/custom/va_joint.sh. I used two files, video.json and audio.json, like this: --data_path ${DATA_DIR}/stage3_video_audio.json,${DATA_DIR}/stage2_audio_subset_new.json. The problem occurred, but when I used only video.json, the issue did not appear. I suspect that it's necessary to use three .json files, for example: --data_path ${DATA_DIR}/stage3_video_audio.json,${DATA_DIR}/stage2_audio_subset_new.json,${DATA_DIR}/stage2_video_subset.json. Traceback (most recent call last): File "/data/hongbo.xu/Datasets/MC-ERU/Video-llama2/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in train() File "/data/hongbo.xu/Datasets/MC-ERU/Video-llama2/VideoLLaMA2-audio_visual/videollama2/train.py", line 660, in train data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args) File "/data/hongbo.xu/Datasets/MC-ERU/Video-llama2/VideoLLaMA2-audio_visual/videollama2/train.py", line 433, in make_supervised_data_module train_dataset = LazySupervisedDataset( File "/data/hongbo.xu/Datasets/MC-ERU/Video-llama2/VideoLLaMA2-audio_visual/videollama2/train.py", line 274, in init raise NotImplementedError NotImplementedError
Thank you for your help amidst your busy schedule!

LiangMeng89 commented 1 week ago

Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == liangmeng89@zju.edu.cn.

xinyifei99 commented 1 week ago

Hi, you can follow the settings of the code below to name the selected data file. image

sjghh commented 1 week ago

Thank you again for your response, but I still have three questions:

  1. If I only have a video_path with audio, must this data be labeled as stage3 to proceed with joint training?
  2. For videos with audio, does the model automatically extract the audio to optimize the audio/video projector and audio encoder?
  3. If it's possible to automatically extract audio from videos for optimization, do I need to separately extract video and audio from the video with audio for training, in order to optimize the audio encoder, audio projector, as well as the language model and spatiotemporal connector?
xinyifei99 commented 1 week ago

If there is only one video_path with audio, this data does not have to be marked as stage3 for joint training. For videos with audio, the model will automatically extract the audio to optimize the audio/video projector and audio encoder. The extracted audio will go through the audio branch, and the video will go through the video branch.

sjghh commented 6 days ago

Thank you again for your response. I have another question: If I have only one .json file, can it include both image and video formats, or do I need to modify the JSON for image to follow the stage2 format?

sjghh commented 6 days ago

@xinyifei99 If I find that I can only process video files when there is a single .json file, how can I train both videos and images simultaneously? Thank you for taking the time to answer!

sjghh commented 6 days ago

I noticed that the AV version of the inference script does not include examples for image inference. Does this mean it cannot perform image inference?