Hang when training with MPI with --tp-comm-overlap turned on

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.

https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html

Apache License 2.0

1.61k stars 256 forks source link

Hang when training with MPI with --tp-comm-overlap turned on #989

Closed lwmlyy closed 3 days ago

lwmlyy commented 4 days ago

TE version: 1.7.0+4e7caa1 Pytorch version: 2.2.0a0+81ea7a4

Some Notes: Using the same four notes and docker, the code works for like an hour and then the Hang problem shows up, then Hang for every latter task.

Full log for the Hang job: test_mpi_mcore-launcher-0_2024-07-04 16_30_59.txt

Full log for the succeeded job: test_mpi_mcore-launcher-0_2024-07-04 16_01_48.txt

lwmlyy commented 3 days ago

In case anyone is interested, the issue is solved by adding "-mca btl_tcp_if_include eth0" to the MPI command.