NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.16k stars 1.35k forks source link

NCCL userbuffer for DP RS in DistOpt #1797

Closed WanZzzzzz closed 2 months ago

WanZzzzzz commented 2 months ago

This PR adds a knob "nccl_ub" in the optimizer to decide whether to use nccl mem allocation and registration (as in this PR: https://github.com/NVIDIA/apex/pull/1796) for DP ReduceScatter communication. When "nccl_ub" is True, the RS output shard buffer will be initialized at the beginning as a whole chunk (just like input buffer) instead of being initialized during the training. This is due to the limitation in NCCL side that NCCL will align to 512MB even when a small buffer is initializated.