Open ShijunK opened 6 months ago
Hi, If you want deterministic (reproducible) results, you need to enable it in PyTorch: https://pytorch.org/docs/stable/generated/torch.use_deterministic_algorithms.html
@danthe3rd , forgot to update on this. with further debugging, we found the output from cutlass memory efficient attn produce wrong result (not small diff due to floating point rounding error, or randomness) on old GPU ( compute capability < 8.0) with xformers 0.0.20 and 0.0.24.
we temporarily work around it, by restricting train and inference jobs to use GPU with compute capability >= 8.0 only,
❓ Questions and Help
memory_efficient_attention fw produce inconsistent results
not sure what was going on? incorrect built? some specific versions combinations?
for some combinations: xformers torch CUDA GPU CUDA Compute Capacity Status v0.0.20+1dc3d7a(built from source) 1.13 11.7 Quadro RTX 6000 7.5 Failed v0.0.20+1dc3d7a(built from source) 1.13 11.7 A100 8 Failed v0.0.21+320b5ad(built from source) 1.13 11.7 Quadro RTX 6000 7.5 Failed v0.0.22+1e065bc(built from source) 1.13 11.7 Quadro RTX 6000 7.5 Failed v0.0.23+1254a16(built from source) 1.13 11.7 Quadro RTX 6000 7.5 Failed
but passed for some: v0.0.20+1dc3d7a(built from source) 1.13 11.7 RTX A6000 8.6 Passed v0.0.24+f7e46d5(built from source) 2.2 11.8 A100 8 Passed v0.0.24+f7e46d5(built from source) 2.2 11.8 RTX A6000 8.6 Passed v0.0.24+f7e46d5(built from source) 2.2 12.1 RTX A6000 8.6 Passed v0.0.24+f7e46d5(built from source) 2.2 12.1 H100 9 Passed 0.0.22.post7(pip install) 2.1 11.8 A100 8 Passed 0.0.23(pip install) 2.1.1 11.8 A100 8 Passed
Command
pytest test_simple.py -v
To Reproduce
Steps to reproduce the behavior: ( for the combination: v0.0.20+1dc3d7a(built from source) 1.13 11.7 A100 )
test code:
output:
Expected behavior
Expect xformers mem efficient attention could produce close enough forward results, when executed twice, across torch versions (1.13 and 2.2) and CUDA versions (11.7, 11.8, 12.1), and GPU with different compute capabilities (7.5, 8.0, 8.6, 9.0), and different q, k seq length, batch size, data types.
Environment
Please copy and paste the output from the environment collection script from PyTorch (or fill out the checklist below manually).
You can run the script with:
conda
,pip
, source):Additional context
one failed environment:
one success environment:
all versions of xformers (
v0.0.20+1dc3d7a
,v0.0.21+320b5ad
,v0.0.22+1e065bc
,v0.0.23+1254a16
,v0.0.24+f7e46d5
) are built from source, except0.0.22.post7
and0.0.23