QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Apache License 2.0
2.88k stars 171 forks source link

how to claim the model support 20min long video? #93

Closed dragen1860 closed 1 month ago

dragen1860 commented 2 months ago

Hi, it claims Qwen2VL support up to 20min long video. But i donot get the details of how to inference on 20min video. What the sampled fps, max tokens you used when claim that?

logicwong commented 2 months ago

@dragen1860 Hi, you can refer to our processed code. the default setting is fps=2, min_frames=4, max_frames=768, min_pixels_perframe=128x28x28, max_pixels_perframe=768x28x28 . This means that even if a video is longer than 20min, we still limit its max_frames to 768. You can refer to video-mme for the quantitative evaluation of the model on long videos (this evaluation includes videos longer than 20 minutes and up to 1 hour,Qwen2-VL ranks 2nd among all models).

orrzohar commented 1 month ago

Hi @logicwong,

768 frames at 2 frames per second means a video duration of 6.4 min not 20min. Or, is fps supposed to be 0.5 (i.e., 1 frame every 2 second), not 2?

Best, Orr

logicwong commented 1 month ago

@orrzohar During Qwen2-VL training, videos with varying fps are allowed. For videos under 512s, fps is mainly set to 2. For longer videos, fps is reduced to avoid OOM due to excessive sequence length. In our practice, max_frames is set to 768, which performs well for 20min video.

orrzohar commented 1 month ago

Hi @logicwong,

I am running into the following when trying to SFT qwen2VL on a custom video dataset:

How did you manage the fact that your vision encoder has a variable batch size during training? wouldn't the number of frames changing every step mess with FSDP, which would expect a constant batch size per node? i.e., on one node, you may need to encode 768 frames, and in another you need to encode 20. How would the 768 even fit in GRAM of a single GPU?

Best, Orr

dragen1860 commented 1 month ago

ok, i got your idea. but it claims performs well on 20min video is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.

logicwong commented 1 month ago

@orrzohar

  1. In our code, we combine multiple samples into one sequence. The model input has just two dimensions (seq_len, hidden_size). We use attention masks to keep different samples separate.
  2. We didn't use FSDP for training.
  3. To balance the computational demands of long video processing with overall training efficiency, we dynamically adjust the resolution of each video frame, limiting the total number of tokens per video to 16384.

More details can refer to our paper

logicwong commented 1 month ago

ok, i got your idea. but it claims performs well on 20min video is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.

I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos

DaozeZhang commented 1 month ago

ok, i got your idea. but it claims performs well on 20min video is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.

I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos

So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.

PaulWongDlut commented 3 weeks ago

ok, i got your idea. but it claims performs well on 20min video is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.

I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos

So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.

in my guess, when the video suppress the length of 768 frames, it adopts the downsampling strategy to generate the final 768 frames.

logicwong commented 3 weeks ago

ok, i got your idea. but it claims performs well on 20min video is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.

I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos

So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.

No,we sample 768 frames uniformly.

linchen111 commented 1 day ago

ok, i got your idea. but it claims performs well on 20min video is not precisely. I suggest you should claim support up 768 frames instead of 20 min directly.好的,我明白你的想法了。但它声称performs well on 20min video并不准确。我建议你应该声明支持最多 768 帧,而不是直接声明 20 分钟。

I see, Claiming support 20min is more intuitive for regular users. In fact, qwen2-vl still performs well with 1280 frames, as Figure 5 in the paper shows. The 80K token limit allows for up to 1280 frames. 768 is a robust number for video-mme long videos我明白了,对于普通用户来说,声明支持 20 分钟更直观。事实上,qwen2-vl 在 1280 帧下仍然表现良好,如论文中的图 5 所示。 80K 令牌限制最多允许 1280 个帧。 768 对于 video-mme 长视频来说是一个稳健的数字

So for a 20-minute video, qwen2-vl can only see the first 6.4 minutes of it? Is this correct? Thank you.那么对于一个20分钟的视频,qwen2-vl只能看到前6.4分钟吗?这是正确的吗?谢谢。

No,we sample 768 frames uniformly.不,我们统一采样768帧。

What is the difference between number of frames and limit_mm_per_prompt={"image": 20,"video": 10}, #