ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

The failing unit tests in test_transducer_joint.py #89

Open hubertlu-tw opened 2 years ago

hubertlu-tw commented 2 years ago

The above four unit tests with "dropout" failed with the following error messages:

Traceback (most recent call last):
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 149, in test_transducer_joint_pack_relu_dropout
    self.run_transducer_joint(for_vector_kernel=False, pack_output=True, relu=True, dropout=True)
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 109, in run_transducer_joint
    mask=mask if dropout else None)
  File "/apex/apex/contrib/test/transducer/transducer_ref.py", line 94, in transducer_joint_reference
    h.backward(h_grad)
  File "/opt/conda/lib/python3.7/site-packages/torch/_tensor.py", line 402, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 193, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [4, 101, 25, 509]], which is output 0 of ReluBackward0, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The above unit test failed with the following error messages:

Traceback (most recent call last):
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 137, in test_transducer_joint_pack_relu
    self.run_transducer_joint(for_vector_kernel=False, pack_output=True, relu=True, dropout=False)
  File "/apex/apex/contrib/test/transducer/test_transducer_joint.py", line 115, in run_transducer_joint
    self.assertTrue(torch.allclose(f_grad_ref, f_grad_tst, atol=1e-5, rtol=1e-5))
AssertionError: False is not true

They are not reproducible with the docker (rocm/pytorch:latest == rocm5.2_ubuntu20.04_py3.7_pytorch_staging) locally. We may need to set them as flaky tests in the future or adjust the tolerance for ROCm.