Open jithunnair-amd opened 2 years ago
A PyTorch commit introduced the above failing unit tests sometime in between rocm/pytorch:rocm4.3.1_ubuntu18.04_py3.6_pytorch_1.9.0 and 2021-12-01.
To further investigate this issue, please follow the below:
git clone https://github.com/ROCmSoftwarePlatform/apex.git -b dev/hubertlu/fused_adam_debug
pytest run_optimizers/test_fused_optimizer.py::TestFusedAdam -s 2>&1 | tee fused_adam_debug.log
The failure for test_half (test_fused_optimizer.TestFusedAdam) is only observed on ROCm. There are some NaNs "sporadically" (99% values are correct compared to the outputs with torch.optim.Adam) showing in the outputs after apex.optimizers.FusedAdam is called to update its parameters.