ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
18 stars 14 forks source link

NaNs in test_half (test_fused_optimizer.TestFusedAdam) #63

Open jithunnair-amd opened 2 years ago

jithunnair-amd commented 2 years ago

The failure for test_half (test_fused_optimizer.TestFusedAdam) is only observed on ROCm. There are some NaNs "sporadically" (99% values are correct compared to the outputs with torch.optim.Adam) showing in the outputs after apex.optimizers.FusedAdam is called to update its parameters.

hubertlu-tw commented 2 years ago

A PyTorch commit introduced the above failing unit tests sometime in between rocm/pytorch:rocm4.3.1_ubuntu18.04_py3.6_pytorch_1.9.0 and 2021-12-01.

hubertlu-tw commented 2 years ago

To further investigate this issue, please follow the below: git clone https://github.com/ROCmSoftwarePlatform/apex.git -b dev/hubertlu/fused_adam_debug pytest run_optimizers/test_fused_optimizer.py::TestFusedAdam -s 2>&1 | tee fused_adam_debug.log