Open kiya00 opened 3 days ago
run this script
import torch
import torchvision
import os
os.environ["NVIDIA_TF32_OVERRIDE"]="0"
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
torch.manual_seed(42)
import random
random.seed(42)
torch.use_deterministic_algorithms(True)
model = torchvision.models.resnet18(weights=None).to(device="cuda", dtype=torch.float32)
x = torch.randn((1, 3, 224, 224), dtype=torch.float32, device="cuda", requires_grad=True)
print(torch.autograd.gradcheck(model, (x,)))
has GradcheckError:
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:768: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:135.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
root@9340b8cf8485:/wayan/lightning-thunder# python thunder/tests/testtrace.py
/usr/local/lib/python3.10/dist-packages/torch/autograd/gradcheck.py:920: UserWarning: Input #0 requires gradient and is not a double precision floating point or complex. This check will likely fail if all the inputs are not of double precision floating point or complex.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py:768: UserWarning: Attempting to run cuBLAS, but there was no cutext... (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:135.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/wayan/lightning-thunder/thunder/tests/testtrace.py", line 15, in <module>
print(torch.autograd.gradcheck(model, (x,)))
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/gradcheck.py", line 2053, in gradcheck
return _gradcheck_helper(**args)
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/gradcheck.py", line 2082, in _gradcheck_helper
_gradcheck_real_imag(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/gradcheck.py", line 1492, in _gradcheck_real_imag
gradcheck_fn(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/gradcheck.py", line 1633, in _slow_gradcheck
raise GradcheckError(
torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor([[ 0.1043, 0.0522, 0.0298, ..., -0.0149, -0.0447, -0.0596],
[ 0.1043, -0.0447, 0.0298, ..., 0.1341, 0.0298, -0.0894],
[ 0.1192, -0.1043, 0.0000, ..., -0.0596, -0.0149, 0.0596],
...,
[ 0.1788, -0.0820, -0.0149, ..., 0.0224, 0.1341, -0.0596],
[ 0.0000, -0.2459, 0.1639, ..., 0.0894, -0.1267, -0.0596],
[ 0.1043, -0.0075, -0.1043, ..., 0.0894, -0.0969, 0.0000]],
device='cuda:0')
analytical:tensor([[ 2.5345e-04, -2.1945e-04, 7.5599e-05, ..., -1.5271e-04,
-1.6242e-04, 4.9330e-04],
[-2.0753e-04, 5.1979e-04, 8.1766e-05, ..., -2.5569e-04,
-2.4477e-04, 1.9414e-04],
[-1.2130e-04, 1.2330e-04, -2.3220e-04, ..., 2.7823e-04,
2.9276e-04, -1.9633e-04],
...,
[-7.2192e-05, -9.3861e-05, -4.2660e-05, ..., -7.6299e-05,
-6.6284e-05, 1.2527e-05],
[ 4.0978e-05, 2.3847e-05, 2.6876e-05, ..., -2.3141e-05,
-1.0444e-06, 1.5903e-05],
[-4.5947e-05, -1.3556e-05, -8.9267e-05, ..., 6.1379e-05,
2.5143e-05, 2.6964e-05]], device='cuda:0')
with float64 it can pass
https://pytorch.org/docs/stable/generated/torch.autograd.gradcheck.gradcheck.html says
Note The default values are designed for input of double precision. This check will likely fail if input is of less precision, e.g., FloatTensor.
however, the values above seem very far off, so I'm wondering whether the operators we call have some bug / input assumptions not satisfied etc.
Note: If you have a model or program that is not supported yet but should be, please use the program coverage template.
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
if train and executor == TorchExecutor:
pytest thunder/tests/test_inplace_functionalization.py -k test_parse_resnet18_torch_cuda_float32[True]
see error:thunder/tests/test_inplace_functionalization.py:187: AssertionError =================================================== short test summary info =================================================== FAILED thunder/tests/test_inplace_functionalization.py::test_parse_resnet18_torch_cuda_float32[True] - AssertionError: Tensor-likes are not close!