Lightning-AI / lightning-thunder

Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors at once; across one or thousands of GPUs.
Apache License 2.0
1.07k stars 60 forks source link

When comparing Thunder Torch Executor to Torch Eager, the ResNet18 gradients are not close for FP32. #655

Open kiya00 opened 3 days ago

kiya00 commented 3 days ago

Note: If you have a model or program that is not supported yet but should be, please use the program coverage template.

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. modify the test case to if train and executor == TorchExecutor:
  2. Run pytest thunder/tests/ -k test_parse_resnet18_torch_cuda_float32[True] see error:
            if train and executor == TorchExecutor: # and dtype == thunder.float64:
                torch_grads = torch.autograd.grad(out1, ref_model.parameters(), torch.ones_like(out1))
                thunder_grads = torch.autograd.grad(out2, jitted.parameters(), torch.ones_like(out2))
    >               torch.testing.assert_close(torch_grads, thunder_grads)
    E               AssertionError: Tensor-likes are not close!
    E               Mismatched elements: 9405 / 9408 (100.0%)
    E               Greatest absolute difference: 0.09205560386180878 at index (4, 1, 5, 0) (up to 1e-05 allowed)
    E               Greatest relative difference: 10.715060234069824 at index (39, 1, 3, 0) (up to 1.3e-06 allowed)
    E               The failure occurred for item [0]

thunder/tests/ AssertionError =================================================== short test summary info =================================================== FAILED thunder/tests/[True] - AssertionError: Tensor-likes are not close!

kiya00 commented 22 hours ago

run this script

import torch
import torchvision

import os
import random

model = torchvision.models.resnet18(weights=None).to(device="cuda", dtype=torch.float32)
x = torch.randn((1, 3, 224, 224), dtype=torch.float32, device="cuda", requires_grad=True)
print(torch.autograd.gradcheck(model, (x,)))

has GradcheckError:

/usr/local/lib/python3.10/dist-packages/torch/autograd/ UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:135.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
root@9340b8cf8485:/wayan/lightning-thunder# python thunder/tests/
/usr/local/lib/python3.10/dist-packages/torch/autograd/ UserWarning: Input #0 requires gradient and is not a double precision floating point or complex. This check will likely fail if all the inputs are not of double precision floating point or complex.
/usr/local/lib/python3.10/dist-packages/torch/autograd/ UserWarning: Attempting to run cuBLAS, but there was no cutext... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:135.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "/wayan/lightning-thunder/thunder/tests/", line 15, in <module>
    print(torch.autograd.gradcheck(model, (x,)))
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/", line 2053, in gradcheck
    return _gradcheck_helper(**args)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/", line 2082, in _gradcheck_helper
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/", line 1492, in _gradcheck_real_imag
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/", line 1633, in _slow_gradcheck
    raise GradcheckError(
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[ 0.1043,  0.0522,  0.0298,  ..., -0.0149, -0.0447, -0.0596],
        [ 0.1043, -0.0447,  0.0298,  ...,  0.1341,  0.0298, -0.0894],
        [ 0.1192, -0.1043,  0.0000,  ..., -0.0596, -0.0149,  0.0596],
        [ 0.1788, -0.0820, -0.0149,  ...,  0.0224,  0.1341, -0.0596],
        [ 0.0000, -0.2459,  0.1639,  ...,  0.0894, -0.1267, -0.0596],
        [ 0.1043, -0.0075, -0.1043,  ...,  0.0894, -0.0969,  0.0000]],
analytical:tensor([[ 2.5345e-04, -2.1945e-04,  7.5599e-05,  ..., -1.5271e-04,
         -1.6242e-04,  4.9330e-04],
        [-2.0753e-04,  5.1979e-04,  8.1766e-05,  ..., -2.5569e-04,
         -2.4477e-04,  1.9414e-04],
        [-1.2130e-04,  1.2330e-04, -2.3220e-04,  ...,  2.7823e-04,
          2.9276e-04, -1.9633e-04],
        [-7.2192e-05, -9.3861e-05, -4.2660e-05,  ..., -7.6299e-05,
         -6.6284e-05,  1.2527e-05],
        [ 4.0978e-05,  2.3847e-05,  2.6876e-05,  ..., -2.3141e-05,
         -1.0444e-06,  1.5903e-05],
        [-4.5947e-05, -1.3556e-05, -8.9267e-05,  ...,  6.1379e-05,
          2.5143e-05,  2.6964e-05]], device='cuda:0')

with float64 it can pass

t-vi commented 19 hours ago says

Note The default values are designed for input of double precision. This check will likely fail if input is of less precision, e.g., FloatTensor.

however, the values above seem very far off, so I'm wondering whether the operators we call have some bug / input assumptions not satisfied etc.