dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

Confusion in pre-process images for long video #77

Closed zhuqiangLu closed 2 months ago

zhuqiangLu commented 3 months ago

Hi,

I am a bit confused by the https://github.com/dvlab-research/LLaMA-VID/blob/d1074f3662a772d1b3c723416af59314ba593f67/llamavid/model/llamavid_arch.py#L430-L431

as the last dimension of shape should be width(or height). And the later line https://github.com/dvlab-research/LLaMA-VID/blob/d1074f3662a772d1b3c723416af59314ba593f67/llamavid/model/llamavid_arch.py#L440 concatenates the image sequence along first dimension, so I believe this images tensor shape should be $L \times C \times H \times W$.

Could you please verify if line 430 should have use first dimension to set the long_video flag?