Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.14k stars 72 forks source link

test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_thunder.dtypes.float16 is failing #1153

Open mruberry opened 3 days ago

mruberry commented 3 days ago

See this excerpt from https://dev.azure.com/Lightning-AI/lightning/_build/results?buildId=215225&view=logs&j=5b0799f7-725e-5b16-9b83-c0a5a25d03f0&t=97651ec4-0b0f-5455-bbb5-3c30427a0a7e

FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_thunder.dtypes.float16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 16 / 192 (8.3%)
Greatest absolute difference: nan at index (0, 0, 0, 0) (up to 1e-05 allowed)
Greatest relative difference: nan at index (0, 0, 0, 0) (up to 0.001 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_nvfuser_cuda_thunder.dtypes.bfloat16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 6 / 192 (3.1%)
Greatest absolute difference: 0.078125 at index (0, 1, 5, 6) (up to 1e-05 allowed)
Greatest relative difference: 0.043701171875 at index (0, 0, 7, 6) (up to 0.016 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_thunder.dtypes.bfloat16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 3 / 192 (1.6%)
Greatest absolute difference: 0.140625 at index (0, 1, 10, 2) (up to 1e-05 allowed)
Greatest relative difference: 1.953125 at index (0, 0, 8, 3) (up to 0.016 allowed)
FAILED thunder/tests/test_grad.py::test_vjp_correctness_sdpa_manual_grad_forward_scaled_dot_product_attention_torch_cuda_thunder.dtypes.float16 - AssertionError: Tensor-likes are not close!

Mismatched elements: 22 / 192 (11.5%)
Greatest absolute difference: nan at index (0, 1, 0, 0) (up to 1e-05 allowed)
Greatest relative difference: nan at index (0, 1, 0, 0) (up to 0.001 allowed)

cc @apaz-cli @borda @carmocca

t-vi commented 3 days ago

I had hoped that this would go away over the last few days, but apparently not. As mentioned on Slack, please ping if this blocks merging for you.

kshitij12345 commented 2 days ago

Looks like this is duplicate of https://github.com/Lightning-AI/lightning-thunder/issues/703

It is just that the PT nightly had a version bump and the skip condition for the test is not true anymore. https://github.com/Lightning-AI/lightning-thunder/blob/5fbeaa7102a514f807fc0a7041bf527d4ceb0eeb/thunder/tests/test_grad.py#L531-L534

Have created https://github.com/Lightning-AI/lightning-thunder/pull/1160 to skip it again.

t-vi commented 1 day ago

When 2.5 is released, we might wonder what's going on... :)