Closed t-vi closed 4 months ago
This is upstream at https://github.com/pytorch/pytorch/pull/128350
With the merged PyTorch PR, we traded the old errors for a number of vjp correctness errors about different results in
thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_bfloat16
and one FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_grad_forward_scaled_dot_product_attention_torch_cuda_float16 - RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans built successfully.
.
While investigating the latter, on my local this seems flaky and prone to hang in the second of the three (float16
, bfloat16
, float32
) tests.
@t-vi cudnn spda is now disabled on pyt main altogether. Moreover, pytorch main has moved to 2.5 and this thunder CI is now running cudnn sdpa tests successfully today. (Link to thunder CI logs) (Link to revert commit)
I think we can reenable sdpa tests for 2.4 too, in case those still affect CI. Or atleast remove those pytest.skip_if
macros.
a number of vjp correctness errors about different results in thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_bfloat16 and one FAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_grad_forward_scaled_dot_product_attention_torch_cuda_float16 - RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans built successfully..
Can you please provide details about your environment? Mainly the:
import cudnn
print(cudnn.__version__)
print(cudnn.backend_version_string())
The RuntimeError: cuDNN Frontend error:
makes me think that this is an old version of thunder/cudnn-frontend. Currently, cudnnex is supposed to reject versions before 1.3 for cudnn frontend.
I have disabled a number of tests failing with 2.4.0a0+. I imagine it is a PyTorch thing, but I'm a bit concerned nonetheless.
thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_bfloat16
and oneFAILED thunder/tests/test_ops.py::test_core_vs_torch_consistency_grad_forward_scaled_dot_product_attention_torch_cuda_float16 - RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans built successfully.
.While investigating the latter, on my local this seems flaky and prone to hang in the second of the three (
float16
,bfloat16
,float32
) tests.The CI job for CUDA PyTorch main branch fails with CUDNN errors since today-ish (European time). Seems to fail 100%, not randomly.
e.g. https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=205024&view=logs&j=5b0799f7-725e-5b16-9b83-c0a5a25d03f0&t=97651ec4-0b0f-5455-bbb5-3c30427a0a7e&l=11885
I don't have an idea yet what happened.
cc @borda