THUDM / LVBench

LVBench: An Extreme Long Video Understanding Benchmark
https://lvbench.github.io
56 stars 1 forks source link

Evaluation details about Qwen2-VL-72B #7

Open lan-lw opened 1 week ago

lan-lw commented 1 week ago

Thank you for your interesting work!

For Qwen2-VL-72B model, are you using the whole video or just sample 48 frames per video? If you used the whole video, what's the fps?

huangshiyu13 commented 1 week ago

We use 48 frames for each video.

lan-lw commented 1 week ago

Thank you for your response.

Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

huangshiyu13 commented 6 days ago

Thank you for your response.

Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

IssacCyj commented 6 days ago

Thank you for your response. Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

huangshiyu13 commented 6 days ago

Thank you for your response. Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.

Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.

Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.

Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results.

IssacCyj commented 6 days ago

thanks for the quick response!