video attention calculation question.

QwenLM / Qwen2-VL

Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Apache License 2.0

2.96k stars 175 forks source link

video attention calculation question. #206

Open Edwardmark opened 1 month ago

Edwardmark commented 1 month ago

In modeling_qwen2_vl.py https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L343

The attention_mask is set for each frame, when not set the frame as batch so that the flops is only frames x flops_per_frame? The implementation here could cause useless calculations resulting frames x frame x flops_per_frame.

ShuaiBai623 commented 1 month ago

Yes, this will indeed lead to extra computation. The main purpose of implementing it this way is to ensure compatibility when calculating image samples of different scales. A simple approach to reduce memory and computation overhead would be to modify it using a for loop. However, we highly recommend using the "flashatten2" implementation, as it includes specific optimizations for handling attention with varying sequence lengths. This is why we strongly recommend using "flashatt".