Open xmc-andy opened 1 year ago
Yes, it's pretty correct! I suggest you use DC mode and use Video pretrained weights. You could see via our web demo, the backend model is Video-LLaMA7B-DC.
Remember to put the multiple images as frames in the [B, T, F, C, H, W]'s F dimension (debug at vision_x
to see the actual dimension during your training)
And I will suggest you to try both template:
1. <image> + prompt
2. <image><image>...<image> + prompt
For training DC, we use the first.
Thank you so much!
Hello, authors! I have a question about choosing a dataset format and corresponding weights. I am doing a classification task with multiple images and prompt input. If multiple images are regarded as videos, there are two options: SD format (single \ + single \, where \ represents all images) and DC mode (single \ + multiple \) . I understand their difference lies in the use of prompt. DC mode is more suitable for each picture with detailed prompts, while SD mode is suitable for all pictures to use a unified prompt. Is my understanding correct?
In addition, I used the Image-MPT7B weight in SD mode before, but it seems that the Video-LLaMA7B-DenseCaption weight in DC/SD mode is more suitable for the video frame mode. Is my understanding correct?