dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
623 stars 40 forks source link

An code error #32

Closed liziming5353 closed 6 months ago

liziming5353 commented 6 months ago

In line 430 in llamavid_arch.py, the judgment criteria should be if images[0].shape[0] > 1000. images[0].shape[-1] is always 224.

yanwei-li commented 6 months ago

Hi, 224 is for raw image input, if the long video is preprocessed with image encoder, the last dimension is 1408.

liziming5353 commented 6 months ago

I am still puzzled. 1408 is the output dim of eva_vit. The 'images' in line 430 have not been feed to eva_vit, so the last dimension must be 224, right?

yanwei-li commented 6 months ago

Hi, for long video, we first extract features of the raw image using eva_vit that is loaded from local file at this line.

liziming5353 commented 6 months ago

I see. Thank you!