ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
18 stars 14 forks source link

Skip failing unit tests #61

Closed hubertlu-tw closed 2 years ago

hubertlu-tw commented 2 years ago

The failing unit tests introduced by the new PyTorch commits related to "if cached_x.grad_fn.next_functions[1][0].variable is not x: IndexError: tuple index out of range" also observed on CUDA (upstream). We will first skip the following unit tests and re-enable them later.

Additionally, the failing test for test_half (test_fused_optimizer.TestFusedAdam) is only observed on ROCm. There are some NaNs "sporadically" (99% values are correct compared to the outputs with torch.optim.Adam) showing in the outputs after apex.optimizers.FusedAdam is called to update its parameters. We are currently looking into the issue.