Closed dragen1860 closed 1 month ago
@dragen1860 Hi, you can refer to our processed code. the default setting is
fps=2, min_frames=4, max_frames=768, min_pixels_perframe=128x28x28, max_pixels_perframe=768x28x28
. This means that even if a video is longer than 20min, we still limit its max_frames to 768. You can refer to video-mme for the quantitative evaluation of the model on long videos (this evaluation includes videos longer than 20 minutes and up to 1 hour,Qwen2-VL ranks 2nd among all models).
Hi @logicwong,
768 frames at 2 frames per second means a video duration of 6.4 min not 20min. Or, is fps supposed to be 0.5 (i.e., 1 frame every 2 second), not 2?
Best, Orr
@orrzohar During Qwen2-VL training, videos with varying fps are allowed. For videos under 512s, fps is mainly set to 2. For longer videos, fps is reduced to avoid OOM due to excessive sequence length. In our practice, max_frames is set to 768, which performs well for 20min video.
Hi @logicwong,
I am running into the following when trying to SFT qwen2VL on a custom video dataset:
How did you manage the fact that your vision encoder has a variable batch size during training? wouldn't the number of frames changing every step mess with FSDP, which would expect a constant batch size per node? i.e., on one node, you may need to encode 768 frames, and in another you need to encode 20. How would the 768 even fit in GRAM of a single GPU?
Best, Orr
ok, i got your idea.
but it claims performs well on 20min video
is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.
@orrzohar
More details can refer to our paper
ok, i got your idea. but it claims
performs well on 20min video
is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.
I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos
ok, i got your idea. but it claims
performs well on 20min video
is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos
So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.
ok, i got your idea. but it claims
performs well on 20min video
is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos
So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.
in my guess, when the video suppress the length of 768 frames, it adopts the downsampling strategy to generate the final 768 frames.
ok, i got your idea. but it claims
performs well on 20min video
is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos
So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.
No,we sample 768 frames uniformly.
ok, i got your idea. but it claims
performs well on 20min video
is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.好的,我明白你的想法了。但它声称performs well on 20min video
并不准确。我建议你应该声明支持最多 768 帧,而不是直接声明 20 分钟。I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos我明白了,对于普通用户来说,声明支持 20 分钟更直观。事实上,qwen2-vl 在 1280 帧下仍然表现良好,如论文中的图 5 所示。 80K 令牌限制最多允许 1280 个帧。 768 对于 video-mme 长视频来说是一个稳健的数字
So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.那么对于一个20分钟的视频,qwen2-vl只能看到前6.4分钟吗?这是正确的吗?谢谢。
No,we sample 768 frames uniformly.不,我们统一采样768帧。
What is the difference between number of frames and limit_mm_per_prompt={"image": 20,"video": 10}, #
Hi, it claims Qwen2VL support up to 20min long video. But i donot get the details of how to inference on 20min video. What the sampled fps, max tokens you used when claim that?