NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.2k stars 1.36k forks source link

[Transformer] Do not use batch_isend_irecv for UCC #1675

Closed Aidyn-A closed 1 year ago

Aidyn-A commented 1 year ago

Transformer tests are failing on UCC because it does not support the latest version of coalescing used in batch_isend_irecv:

  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1575, in _coalescing_manager
    group._start_coalescing(device)
RuntimeError: Backend uccdoes not implement startCoalescing

This is a workaround.