Open lan-lw opened 1 week ago
We use 48 frames for each video.
Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.
Thank you for your response.
Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.
Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.
Thank you for your response. Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.
Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.
Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.
Thank you for your response. Are you uniformly sampling 48 frames? I am wondering for tasks like temporal grounding, are 48 frames enough for precise localization.
Yes, we uniformly sample 48 frames. We used 48 frames because we found that the maximum number of frames we can input is only 48 frames, and exceeding 48 frames can easily cause memory explosion.
Just double check, are you sampling 48 frames from the hour-long video or I misunderstood the answer. This mean you only use 1 frame at every 1-2 minutes video clip which sounds confusing to me, given that the temporal grounding task has second-level annotations.
Yes, the Qwen2-VL-72B model samples an average of 48 frames per video. The main reason for this is that we cannot feed more frames into the deployed model. You can submit the evaluation results to us if you can evaluate Qwen2-VL-72B using more frames. After reviewing, we can update the leaderboard with the new results.
thanks for the quick response!
Thank you for your interesting work!
For Qwen2-VL-72B model, are you using the whole video or just sample 48 frames per video? If you used the whole video, what's the fps?