Closed zhangyongsdu closed 3 years ago
TensorFlow cannot find GPUs on the multiple nodes. I think it's an expected behavior. Such thing is conducted by MPI. To check if your GPUs are really used, you may execute nvidia-smi
in each node.
@njzjz, I can see 4 GPUs by nvidia-smi from node 1. nvidia-smi can not find GPUs from node 2. The supporting sfaff of the supercomputer tell me nvidia-smi is only executed on node 1, therefore only 4 GPUs are found. I guess that tensorflow also only detects the 4 GPUs on node 1. The supporting staff does not have to make tensorflow to detect GPUs on node 2. Do you have any suggestions?
Please see outlog from nvidia-smi
Mon Aug 23 12:55:43 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 |
| N/A 35C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 |
| N/A 35C P0 41W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 |
| N/A 36C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Can't you manually login to node 2? If not, you may consider to use MPI to execute nvidia-smi
. That is
mpirun -np $PBS_NGPUS --map-by ppr:1:numa nvidia-smi
@njzjz I can see 8 GPU (0-3 on node 1 and 0-3 on node 2) which have different UUID. GPU 0 (or 1,2,3) on node1 and node2 have the same PCI-bus id [yxz565@gadi-gpu-v100-0094 ~]$ mpirun -np 2 --map-by ppr:1:node nvidia-smi --list-gpus GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-1df5ae88-6af9-9c27-2165-67d5cddba117) GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-eb9c8869-b124-5234-7a2f-1c9bdef3de9f) GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-66573665-9436-2d47-cb45-3243e436c51f) GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-11b6f0dc-d53e-5293-7eaf-9489d6d27c36) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-86f4b34d-526b-2c6e-c6e1-d627d863358e) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-74a33df5-45e0-7c31-dd18-44324f58c46c) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-bf88d568-c503-1564-b971-3414392b3748) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-3c45135b-df72-fad7-1f3b-ce3d3b47d835)
[yxz565@gadi-gpu-v100-0094 ~]$ mpirun -np 2 --map-by ppr:1:node nvidia-smi Mon Aug 23 14:32:32 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| Mon Aug 23 14:32:32 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 37C P0 41W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 0 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 36C P0 40W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 35C P0 42W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 35C P0 43W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 | | N/A 35C P0 42W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 | | N/A 37C P0 43W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 | | N/A 37C P0 43W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ | 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 | | N/A 38C P0 43W / 300W | 0MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
Can you check the GPU-Util on two nodes when running LAMMPS? The expected behavior is that all GPUs have been used.
@njzjz I use the dp2.00b4 offline package and require 8 GPUs from 2 nodes. GPUs on node 2 are not used at all. There are 8 processes on each GPU of node 1, while there are no prcessess on GPUs of node 2. outlog also indicates that tensorflow only detects and utilizes 4 GPUs on one node.
PBS script:
export PATH=/scratch/qf9/yxz565/softwares/dp200b4-cuda11.3-gpu-offline/bin:$PATH mpirun -np 8 lmp -in alsi
GPU utlization
Node 0 (gadi-gpu-v100-0142):
GPU_ID %GPU GPU_MEM PID GPU_POWER(W) 0 39 411.0MiB 2398534 63.184 0 39 379.0MiB 2398541 63.184 0 39 357.0MiB 2398537 63.184 0 39 349.0MiB 2398538 63.184 0 39 397.0MiB 2398536 63.184 0 39 379.0MiB 2398535 63.184 0 39 351.0MiB 2398540 63.184 0 39 363.0MiB 2398539 63.184 1 0 305.0MiB 2398534 57.489 1 0 305.0MiB 2398541 57.489 1 0 305.0MiB 2398537 57.489 1 0 305.0MiB 2398538 57.489 1 0 305.0MiB 2398536 57.489 1 0 305.0MiB 2398535 57.489 1 0 305.0MiB 2398540 57.489 1 0 305.0MiB 2398539 57.489 2 0 461.0MiB 2398534 67.14 2 0 461.0MiB 2398541 67.14 2 0 461.0MiB 2398537 67.14 2 0 461.0MiB 2398538 67.14 2 0 461.0MiB 2398536 67.14 2 0 461.0MiB 2398535 67.14 2 0 461.0MiB 2398540 67.14 2 0 461.0MiB 2398539 67.14 3 0 461.0MiB 2398534 68.052 3 0 461.0MiB 2398541 68.052 3 0 461.0MiB 2398537 68.052 3 0 461.0MiB 2398538 68.052 3 0 461.0MiB 2398536 68.052 3 0 461.0MiB 2398535 68.052 3 0 461.0MiB 2398540 68.052 3 0 461.0MiB 2398539 68.052
PID S RSS VSZ %MEM TIME %CPU COMMAND
Node 1 (gadi-gpu-v100-0149):
GPU_ID %GPU GPU_MEM PID GPU_POWER(W) 0 0 0 40.893 1 0 0 42.36 2 0 0 40.467 3 0 0 42.811
PID S RSS VSZ %MEM TIME %CPU COMMAND
outlog:
pciBusID: 0000:3d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-25 12:43:08.221869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: pciBusID: 0000:b1:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-25 12:43:08.237019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:3e:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-25 12:43:08.241326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties: pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0
2021-08-25 12:43:08.424851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3 2021-08-25 12:43:08.424928: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-08-25 12:43:10.247933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-08-25 12:43:10.247969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2 3 2021-08-25 12:43:10.247977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y Y Y 2021-08-25 12:43:10.247979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N Y Y 2021-08-25 12:43:10.247981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2: Y Y N Y 2021-08-25 12:43:10.247983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3: Y Y Y N
As MPI in the offline package is not built against PBS, I think you may try mpirun -machinefile $PBS_NODEFILE
to manually tell the MPI the list of nodes.
I have checked the utilizaiton of GPU when LAMMPS is run by mpirun -machinefile $PBS_NODEFILE. GPU-utilization on two nodes are all 0, and an error about our of memory pops out for a very small cell (~100000 atoms).
Node 1 xz565@gadi-gpu-v100-0056 ~]$ nvidia-smi Thu Aug 26 09:41:07 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 36C P0 69W / 300W | 14657MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 33C P0 66W / 300W | 14657MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 | | N/A 35C P0 62W / 300W | 14659MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 | | N/A 36C P0 68W / 300W | 14659MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Node 2
5@gadi-gpu-v100-0055 ~]$ nvidia-smi Thu Aug 26 09:41:52 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:3D:00.0 Off | 0 | | N/A 34C P0 56W / 300W | 32461MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:3E:00.0 Off | 0 | | N/A 33C P0 56W / 300W | 20419MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:B1:00.0 Off | 0 | | N/A 33C P0 56W / 300W | 20419MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:B2:00.0 Off | 0 | | N/A 34C P0 59W / 300W | 20419MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Have you also added -n $PBS_NGPUS
?
Yes. I have added -n $PBS_NGPUS
The current GPU usage of deepmd-kit relies on the CPU process to initiate. Generally, the setting of using one process with one GPU is the most recommended. When using DP across two nodes, users need to use #PBS -l select=2:ncpus=4:ngpus= 4
.
An example pbs job script is given below:
#!/bin/sh
#PBS -q P100_4_30
#PBS -N Al-Cu
#PBS -o out
#PBS -e err
#PBS -l select=2:ncpus=4:ngpus=4
#source /home/wanrun/denghui/dp/env.sh
module load cuda/10.1
module load cuDNN/7.6.0-cuda10.1
source /opt/intel/parallel_studio_xe_2019/psxevars.sh
mpirun -n 8 /opt/deepmd-kit-2.0.0.b3/bin/lmp -in set_MC.in > log.lammps 2>&1
Such instructions specify the number of CPU processes launched on each node to match the GPU resources.
The issue can be avoid by built deepmd-kit from source with the tag: -DUSE_CUDA_TOOLKIT=true.
On a supercomputer with PBS script, I required 8 GPUs on 2 nodes, but tensorflow can only detect 4 GPUs on node 1 with node 2 ignored. I noticed this issue from 2.00b0 to 2.00b4.
The structure of a node is: 2 x 24-core Intel Xeon Platinum 8268 (Cascade Lake) 2.9 GHz CPUs per node 384 GB RAM per node 2 CPU sockets per node, each with 2 NUMA nodes 12 CPU cores per NUMA node 96 GB local RAM per NUMA node 4 x Nvidia Tesla Volta V100-SXM2-32GB per node 480 GB local SSD disk per node Max request of 960 CPU cores (80 GPUs)
deepmd-kit version: v2.0.0.b4-43-gd42ca99
PBS script:
!/bin/bash
PBS -q gpuvolta
PBS -l ngpus=8
PBS -l ncpus=96
PBS -l walltime=0:01:00
PBS -l mem=128GB
PBS -l jobfs=10GB
For licensed software, you have to specify it to get the job running. For unlicensed software, you should also specify it to help us analyse the software usage on our system.
PBS -l software=tensorflow
PBS -l wd
module load cudnn/8.2.2-cuda11.4 module load cuda/11.4.1
module load python3/3.7.4
module load nccl/2.8.4-cuda11.0
module load openmpi/4.1.1
module load cmake/3.18.2
nvidia-smi
mpirun -np $PBS_NGPUS --map-by ppr:1:numa /scratch/qf9/yxz565/softwares/LAMMPS-2020OCT/lmp_mpi_dev-0822 -in input
module load cudnn/8.2.2-cuda11.4 module load cuda/11.4.1
module load openmpi/4.1.1
error outlog:
2021-08-23 12:55:54.094383: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-08-23 12:55:54.094392: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-08-23 12:56:00.821390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1c7f180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-08-23 12:56:00.821435: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:00.821442: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:00.821448: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:00.821452: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:01.066430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:3d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-23 12:56:01.104617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: pciBusID: 0000:3e:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-23 12:56:01.155986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties: pciBusID: 0000:b1:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-23 12:56:01.214623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties: pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s