NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence #986

Open denera opened 5 days ago

denera commented 5 days ago

Description

Multi-node use cases for comm+GEMM overlap in NeMo started hanging at initialize_ub() after PR #901 was merged.

The cause was identified to be the assumption in PR #901 that the user-provided tensor-parallel group is always equivalent to the intra-node MPI communicator in the old MPI-based bootstrapping. As a consequence of this assumption, the new torch.distributed-based bootstrapping could potentially try to initialize Userbuffers with local rank/size information that does not span the entire physical node, thus causing the CUDA Multicast shareable handle communication logic (over Unix domain sockets) to silently fail and hang.

This PR re-implements the equivalent of the old MPI-based bootstrapping logic inside initialize_ub() via torch.distributed collectives. initialize_ub() API reverts back to the old interface where the user only provides a tp_size instead of the TP group. The intra-node process group is constructed internally by matching hostnames across ranks on the same physical node.

+@erhoo82 +@jbaczek for viz.

Type of change

Changes

Please list the changes introduced in this PR:

Checklist: