Closed Mountain-ql closed 2 years ago
@jirikraus can you take a look at this issue?
Thanks for making me aware Mark. I would have missed this. I need to wrap up a few other things and will take a look at this later.
I found the reason is the local domain size, when I used the same hardware structure, that means 4 node and each node 1 A100 GPU, when the local domain size is 4096, the bandwidth is around 800 GB/s, but when the local domain size is 20480, the bandwidth is around 2.4 TB/s, are there some problems with the bandwidth algorithm?
Hi Mountain-ql, sorry for following up late. I did not have the time to deep dive into this yet. I agree that regarding the bandwidth calculations something is of. Regarding the performance difference between CUDA-aware MPI and regular MPI can you provide a few more details on your system? What exact MPI are you using (exact version and how it has been built) and the output of nvidia-smi topo -m on the system you are running on.
sorry for the late reply. the MPI I used was OpenMPI/4.0.5, because it is the module on HPC, so I don't know how it has been built. And the output of "nvidia-smi topo -m" is:
GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X NV12 SYS SYS 0 0-7 GPU1 NV12 X SYS SYS 0 0-7 mlx5_0 SYS SYS X SYS mlx5_1 SYS SYS SYS X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
Thanks. Can can you attach the output of ompi_info -c
and ucx_info -b
that will provide the missing information about the MPI you are using.
sorry for late reply! here is the output of "ompi_info -c": Configured by: hpcglrun Configured on: Wed Feb 17 12:42:06 CET 2021 Configure host: taurusi6395.taurus.hrsk.tu-dresden.de Configure command line: '--prefix=/sw/installed/OpenMPI/4.0.5-gcccuda-2020b' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--with-slurm' '--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64' '--with-knem=/opt/knem-1.1.3.90mlnx1' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-cuda=/sw/installed/CUDAcore/11.1.1' '--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0' '--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0' '--with-ofi=/sw/installed/libfabric/1.11.0-GCCcore-10.2.0' '--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0' '--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1' '--without-verbs' Built by: hpcglrun Built on: Wed Feb 17 12:50:42 CET 2021 Built host: taurusi6395.taurus.hrsk.tu-dresden.de C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /sw/installed/GCCcore/10.2.0/bin/gcc C compiler family name: GNU C compiler version: 10.2.0 C char size: 1 C bool size: 1 C short size: 2 C int size: 4 C long size: 8 C float size: 4 C double size: 8 C pointer size: 8 C char align: 1 C bool align: skipped C int align: 4 C float align: 4 C double align: 8 C++ compiler: g++ C++ compiler absolute: /sw/installed/GCCcore/10.2.0/bin/g++ Fort compiler: gfortran Fort compiler abs: /sw/installed/GCCcore/10.2.0/bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes Fort integer size: 4 Fort logical size: 4 Fort logical value true: 1 Fort have integer1: yes Fort have integer2: yes Fort have integer4: yes Fort have integer8: yes Fort have integer16: no Fort have real4: yes Fort have real8: yes Fort have real16: yes Fort have complex8: yes Fort have complex16: yes Fort have complex32: yes Fort integer1 size: 1 Fort integer2 size: 2 Fort integer4 size: 4 Fort integer8 size: 8 Fort integer16 size: -1 Fort real size: 4 Fort real4 size: 4 Fort real8 size: 8 Fort real16 size: 16 Fort dbl prec size: 8 Fort cplx size: 8 Fort dbl cplx size: 16 Fort cplx8 size: 8 Fort cplx16 size: 16 Fort cplx32 size: 32 Fort integer align: 4 Fort integer1 align: 1 Fort integer2 align: 2 Fort integer4 align: 4 Fort integer8 align: 8 Fort integer16 align: -1 Fort real align: 4 Fort real4 align: 4 Fort real8 align: 8 Fort real16 align: 16 Fort dbl prec align: 8 Fort cplx align: 4 Fort dbl cplx align: 8 Fort cplx8 align: 4 Fort cplx16 align: 8 Fort cplx32 align: 16 C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Build CFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno -finline-functions -fno-strict-aliasing Build CXXFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno -finline-functions Build FCFLAGS: -O3 -march=native -fno-math-errno Build LDFLAGS: -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib64 -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib -L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib64 -L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib -L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib64 -L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib64 -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib64 -L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib -L/sw/installed/GCCcore/10.2.0/lib64 -L/sw/installed/GCCcore/10.2.0/lib -L/sw/installed/CUDAcore/11.1.1/lib64 -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 Build LIBS: -lutil -lm -lrt -lcudart -lpthread -lz -lhwloc -levent_core -levent_pthreads Wrapper extra CFLAGS: Wrapper extra CXXFLAGS: Wrapper extra FCFLAGS: -I${libdir} Wrapper extra LDFLAGS: -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 -Wl,-rpath -Wl,/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -Wl,-rpath -Wl,/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 -Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags Wrapper extra LIBS: -lhwloc -ldl -levent_core -levent_pthreads -lutil -lm -lrt -lcudart -lpthread -lz Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: yes MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI1 compatibility: no MPI extensions: affinity, cuda, pcollreq FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128
here is the output of "ucx_info -b":
Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK
defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES
is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials).
I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.
Thanks a lot!!
Thanks for the feedback. Closing this as it does not seem to be an issue with the code.
I want to add a data point that I am also finding the cuda-aware version to be slightly slower than the normal version.
Similar setup, single node, 2 A100 GPUs. 2 Ranks here. Results are similar for 2x2 topology.
Normal:
Measured lattice updates: 48.04 GLU/s (total), 24.02 GLU/s (per process)
Measured FLOPS: 240.19 GFLOPS (total), 120.10 GFLOPS (per process)
Measured device bandwidth: 3.07 TB/s (total), 1.54 TB/s (per process)
Cuda-aware:
Measured lattice updates: 46.83 GLU/s (total), 23.41 GLU/s (per process)
Measured FLOPS: 234.13 GFLOPS (total), 117.07 GFLOPS (per process)
Measured device bandwidth: 3.00 TB/s (total), 1.50 TB/s (per process)
Please note that to make this code work, I had to change the device code to initialize after MPI init (I am using OpenMPI 5.0.3). The MPI has been compiled with cuda support and Openfabrics (ofi).
I set CUDA_VISIBLE_DEVICES=0,1
. I do watch -n .5 nvidia-smi
while running stuff and it's using the first two devices only.
@Shihab-Shahriar have you profiled the code with NSight Systems and checked what happens during the MPI calls? I am wondering if for some reason for the communication between the two A100s in the node NVLINK (i.e. GPUDirect P2P via CUDA IPC) is not used. If that is the case CUDA-aware MPI would stage via CPU memory in which case it could be expected that normal MPI is faster.
An example how to do this can be found here: https://github.com/NVIDIA/multi-gpu-programming-models/blob/master/mpi/Makefile#L45
I tried to run jacobi_cuda_aware_mpi and jacobi_cuda_normal_mpi on HPC, and I use 2 A100 with 40GB memory as devices. The Max. GPU memory bandwidth is 1,555GB/s, but in the benchmark I got 2.52 TB/s, and when I used the GPUs in a same node, the bandwidth of GPU which with CUDA-aware is slower than normal one...
This is the normal MPI result, which came from 2 Nividia a 100 on the same node: Topology size: 2 x 1 Local domain size (current node): 20480 x 20480 Global domain size (all nodes): 40960 x 20480 normal-ID= 0 normal-ID= 1 Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2): Iteration: 0 - Residue: 0.250000 Iteration: 100 - Residue: 0.002397 Iteration: 200 - Residue: 0.001204 Iteration: 300 - Residue: 0.000804 Iteration: 400 - Residue: 0.000603 Iteration: 500 - Residue: 0.000483 Iteration: 600 - Residue: 0.000403 Iteration: 700 - Residue: 0.000345 Iteration: 800 - Residue: 0.000302 Iteration: 900 - Residue: 0.000269 Iteration: 1000 - Residue: 0.000242 Iteration: 1100 - Residue: 0.000220 Iteration: 1200 - Residue: 0.000201 Iteration: 1300 - Residue: 0.000186 Iteration: 1400 - Residue: 0.000173 Iteration: 1500 - Residue: 0.000161 Iteration: 1600 - Residue: 0.000151 Iteration: 1700 - Residue: 0.000142 Iteration: 1800 - Residue: 0.000134 Iteration: 1900 - Residue: 0.000127 Stopped after 2000 iterations with residue 0.000121 Total Jacobi run time: 21.3250 sec. Average per-process communication time: 0.2794 sec. Measured lattice updates: 78.66 GLU/s (total), 39.33 GLU/s (per process) Measured FLOPS: 393.31 GFLOPS (total), 196.66 GFLOPS (per process) Measured device bandwidth: 5.03 TB/s (total), 2.52 TB/s (per process)
This is the result of CUDA-aware MPI which came from 2 Nvidia a 100 on the same node: Topology size: 2 x 1 Local domain size (current node): 20480 x 20480 Global domain size (all nodes): 40960 x 20480 Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2): Iteration: 0 - Residue: 0.250000 Iteration: 100 - Residue: 0.002397 Iteration: 200 - Residue: 0.001204 Iteration: 300 - Residue: 0.000804 Iteration: 400 - Residue: 0.000603 Iteration: 500 - Residue: 0.000483 Iteration: 600 - Residue: 0.000403 Iteration: 700 - Residue: 0.000345 Iteration: 800 - Residue: 0.000302 Iteration: 900 - Residue: 0.000269 Iteration: 1000 - Residue: 0.000242 Iteration: 1100 - Residue: 0.000220 Iteration: 1200 - Residue: 0.000201 Iteration: 1300 - Residue: 0.000186 Iteration: 1400 - Residue: 0.000173 Iteration: 1500 - Residue: 0.000161 Iteration: 1600 - Residue: 0.000151 Iteration: 1700 - Residue: 0.000142 Iteration: 1800 - Residue: 0.000134 Iteration: 1900 - Residue: 0.000127 Stopped after 2000 iterations with residue 0.000121 Total Jacobi run time: 51.8048 sec. Average per-process communication time: 4.4083 sec. Measured lattice updates: 32.38 GLU/s (total), 16.19 GLU/s (per process) Measured FLOPS: 161.90 GFLOPS (total), 80.95 GFLOPS (per process) Measured device bandwidth: 2.07 TB/s (total), 1.04 TB/s (per process)
I ran them with same node and same GPUs, because I am using the sbatch system, so I changed the flag "ENV_LOCAL_RANK" as "SLURM_LOCALID", but I also tried "OMPI_COMM_WORLD_LOCAL_RANK" because I used OpenMPI, but the result of CUDA-aware MPI were much slower than normal one, when the GPUs on a same node (but if each GPU on the different node CUDA-aware MPI is a little bit faster than normal one), maybe I didn't activate CUDA-aware?
Does someone has any idea about this? Thanks a lot!