Closed marabgol closed 1 year ago
Can you run nvidia-smi through mpirun, i.e.:
mpirun -mca coll_hcoll_enable 0 -H 10.34.0.37:2,10.34.0.38:2 -np 2 nvidia-smi
to make sure you do see 2 GPUs on the node within the MPI environment?
Thanks , yes I can see two GPUs:
mpirun -mca coll_hcoll_enable 0 -H 10.34.0.37:2,10.34.0.38:2 -np 2 -x UCX_NET_DEVICES=eth0 nvidia-smi
Fri Aug 18 15:00:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
Fri Aug 18 15:00:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | Off |
| N/A 28C P0 23W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | Off |
| N/A 28C P0 23W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
| No running processes found |
You're only seeing one GPU. That's not going to work if you launch two tasks per node, as each one will want to use a different GPU.
thanks for reply, what am I missing ? my mpi works fine on two nodes though and on each node nvidia-smi works fine as well:
majid@majidnc24MPTXTH:~$ mpirun -mca coll_hcoll_enable 0 -H 10.34.0.37:2,10.34.0.38:2 -np 4 -x UCX_NET_DEVICES=eth0 /shared/majid/hello.exe
Hello from 1. 4 majidnc24MPTXTH
Hello from 0. 4 majidnc24MPTXTH
Hello from 3. 4 majidnc24OTIQT2
Hello from 2. 4 majidnc24OTIQT2
Perhaps you're running through an allocation system (e.g. SLURM) which restricts the GPUs you can see unless you explicitly request for more GPUs?
Can you paste the output of nvidia-smi outside of mpirun?
I am not using any scheduler. this is the output ( I added hostname ) it looks like it does not communicate with second node:
mpirun -mca coll_hcoll_enable 0 -H 10.34.0.37:2,10.34.0.38:2 -np 2 -x UCX_NET_DEVICES=eth0 nvidia-smi;hostname
Fri Aug 18 15:43:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
Fri Aug 18 15:43:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | Off |
| N/A 28C P0 23W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | Off |
| N/A 28C P0 23W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
| No running processes found |
+---------------------------------------------------------------------------------------+
majidnc24MPTXTH
Can you confirm how many GPUs you have on each node? If you only have one, then you should only launch one task per node.
sure :) one GPU per node, thanks!
majid@majidnc24MPTXTH:~$ nvidia-smi
Fri Aug 18 15:47:01 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | Off |
| N/A 28C P0 23W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
majid@majidnc24OTIQT2:~$ nvidia-smi
Fri Aug 18 15:47:06 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-PCIE-16GB Off | 00000001:00:00.0 Off | Off |
| N/A 26C P0 24W / 250W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Then your mpirun command line should be
mpirun -mca coll_hcoll_enable 0 -H 10.34.0.37:1,10.34.0.38:1 -np 2
Thanks so much for the help. it works :)
@marabgol Hello, Could I asked one question? For two nodes test. If I should configuare ssh for no-pass login?
I am trying to run a test between two nodes ( V100) and I got this error:
appreciate any hints