Closed TobiasLee closed 4 months ago
Hi, thanks for your advice.
Since the core contribution of VoCo-LLaMA is focussed on visual compression, we did not report results on more video benchmarks in the article. We will subsequently try to further explore VoCo-LLaMA's performance on longer video benchmarks. For VITATECS, I think this is a very interesting and worthwhile topic. We found that when the input video sequence grows, higher compression rates (e.g., compressing hundreds of vision tokens into one) can severely affect the temporal understanding, despite the fact that we trained the model on temporal modelling. We'll try VITATECS later.
Thanks again.
Hi, thanks for your awesome work.
I am the author of Video-MME and would like to invite you to evaluate your VoCo framework on Video-MME, which might be better to show your advantages on long videos compared with short videos such as MSRVTT and ActivityNetQA.
By the way, we are also really interested in whether the compression would influence the temporal understanding, our benchmarks VITATECS (ECCV 24, https://github.com/lscpku/VITATECS) and TempCompass (https://huggingface.co/spaces/lyx97/TempCompass) are also welcome for your participation.