Yxxxb / VoCo-LLaMA

VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
https://yxxxb.github.io/VoCo-LLaMA-page/
Apache License 2.0
84 stars 4 forks source link

Evaluation on Video-MME & Temporal Understanding benchmarks #10

Closed TobiasLee closed 4 months ago

TobiasLee commented 4 months ago

Hi, thanks for your awesome work.

I am the author of Video-MME and would like to invite you to evaluate your VoCo framework on Video-MME, which might be better to show your advantages on long videos compared with short videos such as MSRVTT and ActivityNetQA.

By the way, we are also really interested in whether the compression would influence the temporal understanding, our benchmarks VITATECS (ECCV 24, https://github.com/lscpku/VITATECS) and TempCompass (https://huggingface.co/spaces/lyx97/TempCompass) are also welcome for your participation.

Yxxxb commented 4 months ago

Hi, thanks for your advice.

Since the core contribution of VoCo-LLaMA is focussed on visual compression, we did not report results on more video benchmarks in the article. We will subsequently try to further explore VoCo-LLaMA's performance on longer video benchmarks. For VITATECS, I think this is a very interesting and worthwhile topic. We found that when the input video sequence grows, higher compression rates (e.g., compressing hundreds of vision tokens into one) can severely affect the temporal understanding, despite the fact that we trained the model on temporal modelling. We'll try VITATECS later.

Thanks again.