Closed amathews-amd closed 3 years ago
Fixes https://ontrack.amd.com/browse/MSRCHA-137
Adds 4-byte alignment on NCCL/RCCL workloads to speed up workloads. The start location of all data partitions (across worldsize) is aligned to 4-byte boundary.
With upstream changes: https://github.com/microsoft/DeepSpeed/pull/1328
Closing this PR as changes cherry-picked from upstream: https://github.com/ROCmSoftwarePlatform/DeepSpeed/pull/42
Fixes https://ontrack.amd.com/browse/MSRCHA-137
Adds 4-byte alignment on NCCL/RCCL workloads to speed up workloads. The start location of all data partitions (across worldsize) is aligned to 4-byte boundary.
With upstream changes: https://github.com/microsoft/DeepSpeed/pull/1328