Open lucasjinreal opened 3 months ago
Well, that's a long story. In short, we adopt this implementation because it is the simplest way to support batch encoding for a mix of videos and images (although there is some waste of compute). For sure, you can choose to not expand images, then you have to encode the in-batch videos and images one-by-one, via a for loop, which is zero-waste of compute but much slower than batch encoding. Besides, since video data is still the majority of the training set, the wasted compute is actually not that much.
I mean, you don't need without batching in vit, you can still do that, but when the features comes out, simply get by their original frames num, calculate the temporal resample etc accordingly.
Hi, the videoLLama2 got improvement from v1. However, I noticed that there is a little bit high cost.
There is a expanding image into -> videos = [x.unsqueeze(0).expand(num_frames, -1, -1, -1) if modal == 'image' else x for x, modal in zip(images_or_videos, modalities)]
same dimension as video which is b, t, c, h, w
So that, if there were batchSize=10, and numFrames is 8, there should be totally batchSize=80 for vision encoder. Which is extremly HUGE.
Why not consider some other way around to avoid this? Maybe do not expanding image?