Open Marks101 opened 5 days ago
@cyanguwa Do you know what could be causing this?
Hi @Marks101 ,
Thanks for raising this issue. I seem to have overlooked the different window_size
definition in cuDNN. cuDNN supports sliding window (i - window_size_left, i]
, exclusive of the i - window_size_left
element, whereas the original paper, flash-attn and TE unfused DPA have used the definition of [i - window_size_left, i + window_size_right]
, which is inclusive of the boundary elements. Please give #1212 a try and let me know if there's still any issues. Thanks!
Results:
diff flash vs unfused: 0.0330810546875
diff fused vs unfused: 0.033203125
diff flash vs fused: 0.001953125
Hello team,
we have been noticing some pretty large deviations between the attention output of flash/unfused attention versus the fused attention kernels when sliding window attention is active. The following sample illustrates this:
The output we see on H100 and CUDA 12.5 with CUDNN 9.2.1 is:
The later one seems rather large. Can you reproduce these results?