ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

Enable --peer_memory and --nccl p2p extensions for ROCm #87

Closed hubertlu-tw closed 1 year ago

hubertlu-tw commented 2 years ago

These extensions are needed to enable the fast_bottleneck Apex extension. While there are no unit tests for these extensions, there is a test file peer_halo_exchange_module_tests.py added for the --peer_memory extension. The tests in that file pass on CUDA but fail inconsistently on ROCm.

hubertlu-tw commented 1 year ago

jenkins: retest this please

hubertlu-tw commented 1 year ago
cd apex/contrib/peer_memory
torchrun --nproc_per_node 2 peer_halo_exchange_module_tests.py

Some tests failed sporadically on ROCm as follows:

    FAILURE : N,C,H,W = 1,64,168,200, half_halo=1, torch.float16, explicit_nhwc, H-split
    SUCCESS : N,C,H,W = 1,64,168,200, half_halo=1, torch.float16, native nhwc, H-split
    FAILURE : N,C,H,W = 1,64,168,200, half_halo=1, torch.float16, nchw, H-split
    SUCCESS : N,C,H,W = 1,128,84,100, half_halo=1, torch.float16, explicit_nhwc, H-split
    FAILURE : N,C,H,W = 1,128,84,100, half_halo=1, torch.float16, native nhwc, H-split
    SUCCESS : N,C,H,W = 1,128,84,100, half_halo=1, torch.float16, nchw, H-split
    SUCCESS : N,C,H,W = 1,256,42,50, half_halo=1, torch.float16, explicit_nhwc, H-split
    FAILURE : N,C,H,W = 1,256,42,50, half_halo=1, torch.float16, native nhwc, H-split
    SUCCESS : N,C,H,W = 1,256,42,50, half_halo=1, torch.float16, nchw, H-split
    SUCCESS : N,C,H,W = 1,512,21,25, half_halo=1, torch.float16, explicit_nhwc, H-split
    FAILURE : N,C,H,W = 1,512,21,25, half_halo=1, torch.float16, native nhwc, H-split
    SUCCESS : N,C,H,W = 1,512,21,25, half_halo=1, torch.float16, nchw, H-split
    SUCCESS : N,C,H,W = 1,64,200,168, half_halo=1, torch.float16, explicit_nhwc, W-split
    SUCCESS : N,C,H,W = 1,64,200,168, half_halo=1, torch.float16, native nhwc, W-split
    SUCCESS : N,C,H,W = 1,64,200,168, half_halo=1, torch.float16, nchw, W-split
    SUCCESS : N,C,H,W = 1,128,100,84, half_halo=1, torch.float16, explicit_nhwc, W-split
    SUCCESS : N,C,H,W = 1,128,100,84, half_halo=1, torch.float16, native nhwc, W-split
    SUCCESS : N,C,H,W = 1,128,100,84, half_halo=1, torch.float16, nchw, W-split
    SUCCESS : N,C,H,W = 1,256,50,42, half_halo=1, torch.float16, explicit_nhwc, W-split
    SUCCESS : N,C,H,W = 1,256,50,42, half_halo=1, torch.float16, native nhwc, W-split
    SUCCESS : N,C,H,W = 1,256,50,42, half_halo=1, torch.float16, nchw, W-split
    SUCCESS : N,C,H,W = 1,512,25,21, half_halo=1, torch.float16, explicit_nhwc, W-split
    FAILURE : N,C,H,W = 1,512,25,21, half_halo=1, torch.float16, native nhwc, W-split
    SUCCESS : N,C,H,W = 1,512,25,21, half_halo=1, torch.float16, nchw, W-split

The issue has been raised here: https://github.com/ROCmSoftwarePlatform/apex/issues/92.