zhangyongsdu commented 3 years ago

On a supercomputer with PBS script, I required 8 GPUs on 2 nodes, but tensorflow can only detect 4 GPUs on node 1 with node 2 ignored. I noticed this issue from 2.00b0 to 2.00b4.

The structure of a node is: 2 x 24-core Intel Xeon Platinum 8268 (Cascade Lake) 2.9 GHz CPUs per node 384 GB RAM per node 2 CPU sockets per node, each with 2 NUMA nodes 12 CPU cores per NUMA node 96 GB local RAM per NUMA node 4 x Nvidia Tesla Volta V100-SXM2-32GB per node 480 GB local SSD disk per node Max request of 960 CPU cores (80 GPUs)

deepmd-kit version: v2.0.0.b4-43-gd42ca99

PBS script:

!/bin/bash

PBS -q gpuvolta

PBS -l ngpus=8

PBS -l ncpus=96

PBS -l walltime=0:01:00

PBS -l mem=128GB

PBS -l jobfs=10GB

For licensed software, you have to specify it to get the job running. For unlicensed software, you should also specify it to help us analyse the software usage on our system.

PBS -l software=tensorflow

PBS -l wd

module load cudnn/8.2.2-cuda11.4 module load cuda/11.4.1

module load python3/3.7.4

module load nccl/2.8.4-cuda11.0

module load openmpi/4.1.1

module load cmake/3.18.2

nvidia-smi

mpirun -np $PBS_NGPUS --map-by ppr:1:numa /scratch/qf9/yxz565/softwares/LAMMPS-2020OCT/lmp_mpi_dev-0822 -in input

module load cudnn/8.2.2-cuda11.4 module load cuda/11.4.1
module load openmpi/4.1.1

error outlog:

2021-08-23 12:55:54.094383: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-08-23 12:55:54.094392: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2021-08-23 12:56:00.821390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1c7f180 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2021-08-23 12:56:00.821435: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:00.821442: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:00.821448: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:00.821452: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0 2021-08-23 12:56:01.066430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:3d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-23 12:56:01.104617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: pciBusID: 0000:3e:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-23 12:56:01.155986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 2 with properties: pciBusID: 0000:b1:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-23 12:56:01.214623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 3 with properties: pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s

njzjz commented 3 years ago

TensorFlow cannot find GPUs on the multiple nodes. I think it's an expected behavior. Such thing is conducted by MPI. To check if your GPUs are really used, you may execute nvidia-smi in each node.

zhangyongsdu commented 3 years ago

@njzjz, I can see 4 GPUs by nvidia-smi from node 1. nvidia-smi can not find GPUs from node 2. The supporting sfaff of the supercomputer tell me nvidia-smi is only executed on node 1, therefore only 4 GPUs are found. I guess that tensorflow also only detects the 4 GPUs on node 1. The supporting staff does not have to make tensorflow to detect GPUs on node 2. Do you have any suggestions?

Please see outlog from nvidia-smi

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

njzjz commented 3 years ago

Can't you manually login to node 2? If not, you may consider to use MPI to execute nvidia-smi. That is

mpirun -np $PBS_NGPUS --map-by ppr:1:numa nvidia-smi

zhangyongsdu commented 3 years ago

@njzjz I can see 8 GPU (0-3 on node 1 and 0-3 on node 2) which have different UUID. GPU 0 (or 1,2,3) on node1 and node2 have the same PCI-bus id [yxz565@gadi-gpu-v100-0094 ~]$ mpirun -np 2 --map-by ppr:1:node nvidia-smi --list-gpus GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-1df5ae88-6af9-9c27-2165-67d5cddba117) GPU 0: Tesla V100-SXM2-32GB (UUID: GPU-eb9c8869-b124-5234-7a2f-1c9bdef3de9f) GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-66573665-9436-2d47-cb45-3243e436c51f) GPU 1: Tesla V100-SXM2-32GB (UUID: GPU-11b6f0dc-d53e-5293-7eaf-9489d6d27c36) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-86f4b34d-526b-2c6e-c6e1-d627d863358e) GPU 2: Tesla V100-SXM2-32GB (UUID: GPU-74a33df5-45e0-7c31-dd18-44324f58c46c) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-bf88d568-c503-1564-b971-3414392b3748) GPU 3: Tesla V100-SXM2-32GB (UUID: GPU-3c45135b-df72-fad7-1f3b-ce3d3b47d835)

njzjz commented 3 years ago

Can you check the GPU-Util on two nodes when running LAMMPS? The expected behavior is that all GPUs have been used.

zhangyongsdu commented 3 years ago

@njzjz I use the dp2.00b4 offline package and require 8 GPUs from 2 nodes. GPUs on node 2 are not used at all. There are 8 processes on each GPU of node 1, while there are no prcessess on GPUs of node 2. outlog also indicates that tensorflow only detects and utilizes 4 GPUs on one node.

PBS script:

PBS -q gpuvolta

PBS -l ngpus=8

PBS -l ncpus=96

PBS -l walltime=48:00:00

PBS -l mem=128GB

PBS -l jobfs=10GB

For licensed software, you have to specify it to get the job running. For unlicensed software, you should also specify it to help us analyse the software usage on our system.

PBS -l software=tensorflow

PBS -l wd

export PATH=/scratch/qf9/yxz565/softwares/dp200b4-cuda11.3-gpu-offline/bin:$PATH mpirun -np 8 lmp -in alsi

GPU utlization

Node 0 (gadi-gpu-v100-0142):

GPU_ID %GPU GPU_MEM PID GPU_POWER(W) 0 39 411.0MiB 2398534 63.184 0 39 379.0MiB 2398541 63.184 0 39 357.0MiB 2398537 63.184 0 39 349.0MiB 2398538 63.184 0 39 397.0MiB 2398536 63.184 0 39 379.0MiB 2398535 63.184 0 39 351.0MiB 2398540 63.184 0 39 363.0MiB 2398539 63.184 1 0 305.0MiB 2398534 57.489 1 0 305.0MiB 2398541 57.489 1 0 305.0MiB 2398537 57.489 1 0 305.0MiB 2398538 57.489 1 0 305.0MiB 2398536 57.489 1 0 305.0MiB 2398535 57.489 1 0 305.0MiB 2398540 57.489 1 0 305.0MiB 2398539 57.489 2 0 461.0MiB 2398534 67.14 2 0 461.0MiB 2398541 67.14 2 0 461.0MiB 2398537 67.14 2 0 461.0MiB 2398538 67.14 2 0 461.0MiB 2398536 67.14 2 0 461.0MiB 2398535 67.14 2 0 461.0MiB 2398540 67.14 2 0 461.0MiB 2398539 67.14 3 0 461.0MiB 2398534 68.052 3 0 461.0MiB 2398541 68.052 3 0 461.0MiB 2398537 68.052 3 0 461.0MiB 2398538 68.052 3 0 461.0MiB 2398536 68.052 3 0 461.0MiB 2398535 68.052 3 0 461.0MiB 2398540 68.052 3 0 461.0MiB 2398539 68.052

PID S   RSS    VSZ %MEM     TIME %CPU COMMAND

Node 1 (gadi-gpu-v100-0149):

GPU_ID %GPU GPU_MEM PID GPU_POWER(W) 0 0 0 40.893 1 0 0 42.36 2 0 0 40.467 3 0 0 42.811

PID S   RSS    VSZ %MEM     TIME %CPU COMMAND

outlog:

pciBusID: 0000:3d:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-25 12:43:08.221869: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 2 with properties: pciBusID: 0000:b1:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-25 12:43:08.237019: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties: pciBusID: 0000:3e:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0 coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 836.37GiB/s 2021-08-25 12:43:08.241326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 3 with properties: pciBusID: 0000:b2:00.0 name: Tesla V100-SXM2-32GB computeCapability: 7.0

2021-08-25 12:43:08.424851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1, 2, 3 2021-08-25 12:43:08.424928: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0 2021-08-25 12:43:10.247933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-08-25 12:43:10.247969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 1 2 3 2021-08-25 12:43:10.247977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N Y Y Y 2021-08-25 12:43:10.247979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1: Y N Y Y 2021-08-25 12:43:10.247981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2: Y Y N Y 2021-08-25 12:43:10.247983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3: Y Y Y N

njzjz commented 3 years ago

As MPI in the offline package is not built against PBS, I think you may try mpirun -machinefile $PBS_NODEFILE to manually tell the MPI the list of nodes.

zhangyongsdu commented 3 years ago

I have checked the utilizaiton of GPU when LAMMPS is run by mpirun -machinefile $PBS_NODEFILE. GPU-utilization on two nodes are all 0, and an error about our of memory pops out for a very small cell (~100000 atoms).

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

Node 2

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

njzjz commented 3 years ago

Have you also added -n $PBS_NGPUS?

zhangyongsdu commented 3 years ago

Yes. I have added -n $PBS_NGPUS

denghuilu commented 3 years ago

The current GPU usage of deepmd-kit relies on the CPU process to initiate. Generally, the setting of using one process with one GPU is the most recommended. When using DP across two nodes, users need to use #PBS -l select=2:ncpus=4:ngpus= 4.

An example pbs job script is given below:

#!/bin/sh
#PBS -q P100_4_30
#PBS -N Al-Cu
#PBS -o out
#PBS -e err
#PBS -l select=2:ncpus=4:ngpus=4
#source /home/wanrun/denghui/dp/env.sh
module load cuda/10.1
module load cuDNN/7.6.0-cuda10.1

source /opt/intel/parallel_studio_xe_2019/psxevars.sh

mpirun -n 8 /opt/deepmd-kit-2.0.0.b3/bin/lmp -in set_MC.in > log.lammps 2>&1

Such instructions specify the number of CPU processes launched on each node to match the GPU resources.

zhangyongsdu commented 3 years ago

The issue can be avoid by built deepmd-kit from source with the tag: -DUSE_CUDA_TOOLKIT=true.

deepmodeling / deepmd-kit

GPUs on node 2 can not be detected by tensorflow #1018

!/bin/bash

PBS -q gpuvolta

PBS -l ngpus=8

PBS -l ncpus=96

PBS -l walltime=0:01:00

PBS -l mem=128GB

PBS -l jobfs=10GB

For licensed software, you have to specify it to get the job running. For unlicensed software, you should also specify it to help us analyse the software usage on our system.

PBS -l software=tensorflow

PBS -l wd

module load python3/3.7.4

module load nccl/2.8.4-cuda11.0

module load cmake/3.18.2

PBS -q gpuvolta

PBS -l ngpus=8

PBS -l ncpus=96

PBS -l walltime=48:00:00

PBS -l mem=128GB

PBS -l jobfs=10GB

For licensed software, you have to specify it to get the job running. For unlicensed software, you should also specify it to help us analyse the software usage on our system.

PBS -l software=tensorflow

PBS -l wd