NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

[C/PyTorch] Removed MPI dependence in Userbuffers #901

Closed denera closed 3 weeks ago

denera commented 3 weeks ago

Description

This PR splits off the userbuffers MPI-dependence removal from PR #760.

With these changes, userbuffers is now bootstrapped via callbacks to torch.distributed collectives. In the absence of the MPI dependence, userbuffers is always compiled as part of the PyTorch extension and no longer requires the NVTE_WITH_USERBUFFERS=1 flag.

The old MPI-based bootstrapping can be re-activated via UB_MPI_BOOTSTRAP=1 at compile time.

Type of change

Changes

Checklist:

timmoon10 commented 3 weeks ago

/te-ci pytorch

timmoon10 commented 3 weeks ago

/te-ci pytorch