Closed liziming5353 closed 6 months ago
Hi, 224 is for raw image input, if the long video is preprocessed with image encoder, the last dimension is 1408.
I am still puzzled. 1408 is the output dim of eva_vit. The 'images' in line 430 have not been feed to eva_vit, so the last dimension must be 224, right?
Hi, for long video, we first extract features of the raw image using eva_vit that is loaded from local file at this line.
I see. Thank you!
In line 430 in llamavid_arch.py, the judgment criteria should be if images[0].shape[0] > 1000. images[0].shape[-1] is always 224.