Closed lwmlyy closed 3 days ago
TE version: 1.7.0+4e7caa1 Pytorch version: 2.2.0a0+81ea7a4
Some Notes: Using the same four notes and docker, the code works for like an hour and then the Hang problem shows up, then Hang for every latter task.
Full log for the Hang job: test_mpi_mcore-launcher-0_2024-07-04 16_30_59.txt
Full log for the succeeded job: test_mpi_mcore-launcher-0_2024-07-04 16_01_48.txt
In case anyone is interested, the issue is solved by adding "-mca btl_tcp_if_include eth0" to the MPI command.
TE version: 1.7.0+4e7caa1 Pytorch version: 2.2.0a0+81ea7a4
Some Notes: Using the same four notes and docker, the code works for like an hour and then the Hang problem shows up, then Hang for every latter task.
Full log for the Hang job: test_mpi_mcore-launcher-0_2024-07-04 16_30_59.txt
Full log for the succeeded job: test_mpi_mcore-launcher-0_2024-07-04 16_01_48.txt