Open Edwardmark opened 1 month ago
Yes, this will indeed lead to extra computation. The main purpose of implementing it this way is to ensure compatibility when calculating image samples of different scales. A simple approach to reduce memory and computation overhead would be to modify it using a for loop. However, we highly recommend using the "flashatten2" implementation, as it includes specific optimizations for handling attention with varying sequence lengths. This is why we strongly recommend using "flashatt".
In modeling_qwen2_vl.py https://github.com/huggingface/transformers/blob/main/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L343
The attention_mask is set for each frame, when not set the frame as batch so that the flops is only frames x flops_per_frame? The implementation here could cause useless calculations resulting frames x frame x flops_per_frame.