A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
This PR splits off the userbuffers MPI-dependence removal from PR #760.
With these changes, userbuffers is now bootstrapped via callbacks to torch.distributed collectives. In the absence of the MPI dependence, userbuffers is always compiled as part of the PyTorch extension and no longer requires the NVTE_WITH_USERBUFFERS=1 flag.
The old MPI-based bootstrapping can be re-activated via UB_MPI_BOOTSTRAP=1 at compile time.
Type of change
[ ] Documentation change (change only to the documentation, either a fix or a new content)
[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
Changes
TE build no longer supports the NVTE_WITH_USERBUFFERS=1 option and Userbuffers is now always compiled into the PyTorch extensions module.
MPI collectives in Userbuffers bootstrapping are replaced with callbacks to torch.distributed collectives.
New UB_MPI_BOOTSTRAP=1 option in the TE build activates the old Userbuffers bootstrapping via MPI collectives.
transformer_engine.pytorch.module.base.initialize_ub(...) is now more conveniently accessible via transformer_engine.pytorch.initialize_ub(...).
Userbuffer communicators can now be cleaned up via transformer_engine.pytorch.destroy_ub(...).
transformer_engine.pytorch.initialize_ub(...) now requires the tensor-parallel process group instead of just the tensor-parallel size.
Added comm+GEMM overlap example with te.LayerNormMLP.
Description
This PR splits off the userbuffers MPI-dependence removal from PR #760.
With these changes, userbuffers is now bootstrapped via callbacks to
torch.distributed
collectives. In the absence of the MPI dependence, userbuffers is always compiled as part of the PyTorch extension and no longer requires theNVTE_WITH_USERBUFFERS=1
flag.The old MPI-based bootstrapping can be re-activated via
UB_MPI_BOOTSTRAP=1
at compile time.Type of change
Changes
NVTE_WITH_USERBUFFERS=1
option and Userbuffers is now always compiled into the PyTorch extensions module.torch.distributed
collectives.UB_MPI_BOOTSTRAP=1
option in the TE build activates the old Userbuffers bootstrapping via MPI collectives.transformer_engine.pytorch.module.base.initialize_ub(...)
is now more conveniently accessible viatransformer_engine.pytorch.initialize_ub(...)
.transformer_engine.pytorch.destroy_ub(...)
.transformer_engine.pytorch.initialize_ub(...)
now requires the tensor-parallel process group instead of just the tensor-parallel size.te.LayerNormMLP
.Checklist: