DAMO-NLP-SG / VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Apache License 2.0
752 stars 50 forks source link

About the training efficiency #22

Open lucasjinreal opened 3 months ago

lucasjinreal commented 3 months ago

Hi, the videoLLama2 got improvement from v1. However, I noticed that there is a little bit high cost.

There is a expanding image into -> videos = [x.unsqueeze(0).expand(num_frames, -1, -1, -1) if modal == 'image' else x for x, modal in zip(images_or_videos, modalities)]

same dimension as video which is b, t, c, h, w

So that, if there were batchSize=10, and numFrames is 8, there should be totally batchSize=80 for vision encoder. Which is extremly HUGE.

Why not consider some other way around to avoid this? Maybe do not expanding image?

lixin4ever commented 3 months ago

Well, that's a long story. In short, we adopt this implementation because it is the simplest way to support batch encoding for a mix of videos and images (although there is some waste of compute). For sure, you can choose to not expand images, then you have to encode the in-batch videos and images one-by-one, via a for loop, which is zero-waste of compute but much slower than batch encoding. Besides, since video data is still the majority of the training set, the wasted compute is actually not that much.

lucasjinreal commented 3 months ago

I mean, you don't need without batching in vit, you can still do that, but when the features comes out, simply get by their original frames num, calculate the temporal resample etc accordingly.