ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

Failing tests in --peer_memory #92

Open hubertlu-tw opened 1 year ago

hubertlu-tw commented 1 year ago

Please find the comment in the PR we enabled --peer_memory and --nccl_p2p extensions: https://github.com/ROCmSoftwarePlatform/apex/pull/87#issuecomment-1239989909

Some tests failed sporadically on ROCm by running the following test script:

cd apex/contrib/peer_memory
torchrun --nproc_per_node 2 peer_halo_exchange_module_tests.py