Open zidanehuang001 opened 4 days ago
Oh seems like I didn't paste full error log, add it here:
### mode = 'bwd', batch_size = 1, headdim = 256, seqlen = 8192, causal = False ###
Traceback (most recent call last):
File "/workspace/code/flash-attention/hopper/benchmark_attn.py", line 294, in <module>
torch.testing.assert_close(ref_dv, dv, atol=0.05, rtol=0.05)
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_comparison.py", line 1530, in assert_close
raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!
Mismatched elements: 5463253 / 75497472 (7.2%)
Greatest absolute difference: 4.13671875 at index (0, 4096, 1, 136) (up to 0.05 allowed)
Greatest relative difference: inf at index (0, 1857, 28, 186) (up to 0.05 allowed)
Please try the tdd
branch which supports bwd for hdim up to 256. We'll merge it soon.
Thanks! I will try
Please try the
tdd
branch which supports bwd for hdim up to 256. We'll merge it soon.
Thank you for the solution, now I can run hdim256 bwd with aligned output!
One more thing, it looks like hdim256 bwd 9.215 ms
is >4x latency to fwd 1.926 ms
, it there any room for further improvement?
### mode = 'fwd', batch_size = 1, headdim = 256, seqlen = 8192, causal = False ###
Fav2: 7.572ms, 326.7 TFLOPS
Fav3: 3.476ms, 711.7 TFLOPS
Fav3 varlen: 3.870ms, 639.3 TFLOPS
### mode = 'fwd', batch_size = 1, headdim = 256, seqlen = 8192, causal = True ###
Fav2: 4.137ms, 299.0 TFLOPS
Fav3: 1.926ms, 642.1 TFLOPS
Fav3 varlen: 2.014ms, 614.3 TFLOPS
### mode = 'bwd', batch_size = 1, headdim = 256, seqlen = 8192, causal = False ###
Fav2: 23.420ms, 264.1 TFLOPS
Fav3: 17.491ms, 353.6 TFLOPS
Fav3 varlen: 18.090ms, 341.9 TFLOPS
### mode = 'bwd', batch_size = 1, headdim = 256, seqlen = 8192, causal = True ###
Fav2: 11.967ms, 258.4 TFLOPS
Fav3: 9.215ms, 335.6 TFLOPS
Fav3 varlen: 9.771ms, 316.5 TFLOPS
You're welcome to work on it! We've been best perf with CUDA 12.3.
Hello,
I'm trying to test head_dim=256 backward performance on H100, with below modifications, I manager to make it run. However, it reports test mismatch in result comparing. Modifications:
run_mha_bwd_hdim256
inhopper/flash_bwd_launch_template.h
,64,64
is refering tohttps://github.com/Dao-AILab/flash-attention/blob/main/csrc/flash_attn/src/flash_bwd_launch_template.h#L301
:head_size
limit from 128 to 256 inhopper/flash_api.cpp
"flash_bwd_hdim256_fp16_sm90.cu"
inhopper/setup.py
When running 'hopper/benchmark_attn.py', with 'batch_size = 1, seqlen=8192, nheads = 36', I came across this error which indicates result mismatch:
PS: I noticed there is a TODO for headdim256 bwd in
hopper/flash_api.cpp
, could this lead to the mismatch, anything needed to tuning the number in it? Seems my modification above shouldn't introduce above error.