NCCL userbuffer for DP RS in DistOpt

This PR adds a knob "nccl_ub" in the optimizer to decide whether to use nccl mem allocation and registration (as in this PR: https://github.com/NVIDIA/apex/pull/1796) for DP ReduceScatter communication. When "nccl_ub" is True, the RS output shard buffer will be initialized at the beginning as a whole chunk (just like input buffer) instead of being initialized during the training. This is due to the limitation in NCCL side that NCCL will align to 512MB even when a small buffer is initializated.

NVIDIA / apex

NCCL userbuffer for DP RS in DistOpt #1797