Closed sdonoso closed 1 month ago
You pretty much did your own diagnostics: your MPI installation is not working correctly (MPI hello world doesn't run as expected). This is not a NCCL issue. Ask OpenMPI community for help if you can't figure out the fix on your own.
I compile with MPI=1, check for same version in the two nodes of OpenMPI. I compile OpenMPI with UCX. When i run the follow:
The process hang.
If i try the next:
the process hang after the hello world
The Environment
I have two nodes each one with 8 A100 , if i run with
np 1
work well but is not multinode How can i fix it?