Closed zhuqiangLu closed 2 months ago
Hi,
I am a bit confused by the https://github.com/dvlab-research/LLaMA-VID/blob/d1074f3662a772d1b3c723416af59314ba593f67/llamavid/model/llamavid_arch.py#L430-L431
as the last dimension of shape should be width(or height). And the later line https://github.com/dvlab-research/LLaMA-VID/blob/d1074f3662a772d1b3c723416af59314ba593f67/llamavid/model/llamavid_arch.py#L440 concatenates the image sequence along first dimension, so I believe this images tensor shape should be $L \times C \times H \times W$.
Could you please verify if line 430 should have use first dimension to set the long_video flag?
long_video
Hi,
I am a bit confused by the https://github.com/dvlab-research/LLaMA-VID/blob/d1074f3662a772d1b3c723416af59314ba593f67/llamavid/model/llamavid_arch.py#L430-L431
as the last dimension of shape should be width(or height). And the later line https://github.com/dvlab-research/LLaMA-VID/blob/d1074f3662a772d1b3c723416af59314ba593f67/llamavid/model/llamavid_arch.py#L440 concatenates the image sequence along first dimension, so I believe this images tensor shape should be $L \times C \times H \times W$.
Could you please verify if line 430 should have use first dimension to set the
long_video
flag?