NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Test CUDA failure common.cu:892 'invalid device ordinal' #165

Closed marabgol closed 1 year ago

marabgol commented 1 year ago

I am trying to run a test between two nodes ( V100) and I got this error:

mpirun  -mca coll_hcoll_enable 0   -H 10.34.0.37:2,10.34.0.38:2 -np 2  -x UCX_NET_DEVICES=eth0    /opt/nccl-tests/build/all_reduce_perf  -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
majidnc24MPTXTH: Test CUDA failure common.cu:892 'invalid device ordinal'
 .. majidnc24MPTXTH pid 9972: Test failure common.cu:842
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[25735,1],1]
  Exit code:    2
--------------------------------------------------------------------------

appreciate any hints

sjeaugey commented 1 year ago

Can you run nvidia-smi through mpirun, i.e.:

mpirun -mca coll_hcoll_enable 0   -H 10.34.0.37:2,10.34.0.38:2 -np 2 nvidia-smi

to make sure you do see 2 GPUs on the node within the MPI environment?

marabgol commented 1 year ago

Thanks , yes I can see two GPUs:

mpirun  -mca coll_hcoll_enable 0   -H 10.34.0.37:2,10.34.0.38:2 -np 2  -x UCX_NET_DEVICES=eth0    nvidia-smi
Fri Aug 18 15:00:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
Fri Aug 18 15:00:55 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0              23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0              23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
|  No running processes found                                                           |
sjeaugey commented 1 year ago

You're only seeing one GPU. That's not going to work if you launch two tasks per node, as each one will want to use a different GPU.

marabgol commented 1 year ago

thanks for reply, what am I missing ? my mpi works fine on two nodes though and on each node nvidia-smi works fine as well:

majid@majidnc24MPTXTH:~$ mpirun  -mca coll_hcoll_enable 0   -H 10.34.0.37:2,10.34.0.38:2 -np 4  -x UCX_NET_DEVICES=eth0  /shared/majid/hello.exe
Hello from  1. 4 majidnc24MPTXTH
Hello from  0. 4 majidnc24MPTXTH
Hello from  3. 4 majidnc24OTIQT2
Hello from  2. 4 majidnc24OTIQT2
sjeaugey commented 1 year ago

Perhaps you're running through an allocation system (e.g. SLURM) which restricts the GPUs you can see unless you explicitly request for more GPUs?

Can you paste the output of nvidia-smi outside of mpirun?

marabgol commented 1 year ago

I am not using any scheduler. this is the output ( I added hostname ) it looks like it does not communicate with second node:

mpirun  -mca coll_hcoll_enable 0   -H 10.34.0.37:2,10.34.0.38:2 -np 2  -x UCX_NET_DEVICES=eth0     nvidia-smi;hostname
Fri Aug 18 15:43:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
Fri Aug 18 15:43:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0              23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0              23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
majidnc24MPTXTH
sjeaugey commented 1 year ago

Can you confirm how many GPUs you have on each node? If you only have one, then you should only launch one task per node.

marabgol commented 1 year ago

sure :) one GPU per node, thanks!

majid@majidnc24MPTXTH:~$ nvidia-smi
Fri Aug 18 15:47:01 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   28C    P0              23W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

majid@majidnc24OTIQT2:~$ nvidia-smi
Fri Aug 18 15:47:06 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-PCIE-16GB           Off | 00000001:00:00.0 Off |                  Off |
| N/A   26C    P0              24W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
sjeaugey commented 1 year ago

Then your mpirun command line should be

mpirun  -mca coll_hcoll_enable 0   -H 10.34.0.37:1,10.34.0.38:1 -np 2
marabgol commented 1 year ago

Thanks so much for the help. it works :)

jeffreyyjp commented 1 month ago

@marabgol Hello, Could I asked one question? For two nodes test. If I should configuare ssh for no-pass login?