NVIDIA-developer-blog / code-samples

Source code examples from the Parallel Forall Blog
BSD 3-Clause "New" or "Revised" License
1.24k stars 633 forks source link

error by using cuda-aware-mpi-example, bandwidth was wrong #41

Closed Mountain-ql closed 2 years ago

Mountain-ql commented 2 years ago

I tried to run jacobi_cuda_aware_mpi and jacobi_cuda_normal_mpi on HPC, and I use 2 A100 with 40GB memory as devices. The Max. GPU memory bandwidth is 1,555GB/s, but in the benchmark I got 2.52 TB/s, and when I used the GPUs in a same node, the bandwidth of GPU which with CUDA-aware is slower than normal one...

This is the normal MPI result, which came from 2 Nividia a 100 on the same node: Topology size: 2 x 1 Local domain size (current node): 20480 x 20480 Global domain size (all nodes): 40960 x 20480 normal-ID= 0 normal-ID= 1 Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2): Iteration: 0 - Residue: 0.250000 Iteration: 100 - Residue: 0.002397 Iteration: 200 - Residue: 0.001204 Iteration: 300 - Residue: 0.000804 Iteration: 400 - Residue: 0.000603 Iteration: 500 - Residue: 0.000483 Iteration: 600 - Residue: 0.000403 Iteration: 700 - Residue: 0.000345 Iteration: 800 - Residue: 0.000302 Iteration: 900 - Residue: 0.000269 Iteration: 1000 - Residue: 0.000242 Iteration: 1100 - Residue: 0.000220 Iteration: 1200 - Residue: 0.000201 Iteration: 1300 - Residue: 0.000186 Iteration: 1400 - Residue: 0.000173 Iteration: 1500 - Residue: 0.000161 Iteration: 1600 - Residue: 0.000151 Iteration: 1700 - Residue: 0.000142 Iteration: 1800 - Residue: 0.000134 Iteration: 1900 - Residue: 0.000127 Stopped after 2000 iterations with residue 0.000121 Total Jacobi run time: 21.3250 sec. Average per-process communication time: 0.2794 sec. Measured lattice updates: 78.66 GLU/s (total), 39.33 GLU/s (per process) Measured FLOPS: 393.31 GFLOPS (total), 196.66 GFLOPS (per process) Measured device bandwidth: 5.03 TB/s (total), 2.52 TB/s (per process)

This is the result of CUDA-aware MPI which came from 2 Nvidia a 100 on the same node: Topology size: 2 x 1 Local domain size (current node): 20480 x 20480 Global domain size (all nodes): 40960 x 20480 Starting Jacobi run with 2 processes using "A100-SXM4-40GB" GPUs (ECC enabled: 2 / 2): Iteration: 0 - Residue: 0.250000 Iteration: 100 - Residue: 0.002397 Iteration: 200 - Residue: 0.001204 Iteration: 300 - Residue: 0.000804 Iteration: 400 - Residue: 0.000603 Iteration: 500 - Residue: 0.000483 Iteration: 600 - Residue: 0.000403 Iteration: 700 - Residue: 0.000345 Iteration: 800 - Residue: 0.000302 Iteration: 900 - Residue: 0.000269 Iteration: 1000 - Residue: 0.000242 Iteration: 1100 - Residue: 0.000220 Iteration: 1200 - Residue: 0.000201 Iteration: 1300 - Residue: 0.000186 Iteration: 1400 - Residue: 0.000173 Iteration: 1500 - Residue: 0.000161 Iteration: 1600 - Residue: 0.000151 Iteration: 1700 - Residue: 0.000142 Iteration: 1800 - Residue: 0.000134 Iteration: 1900 - Residue: 0.000127 Stopped after 2000 iterations with residue 0.000121 Total Jacobi run time: 51.8048 sec. Average per-process communication time: 4.4083 sec. Measured lattice updates: 32.38 GLU/s (total), 16.19 GLU/s (per process) Measured FLOPS: 161.90 GFLOPS (total), 80.95 GFLOPS (per process) Measured device bandwidth: 2.07 TB/s (total), 1.04 TB/s (per process)

I ran them with same node and same GPUs, because I am using the sbatch system, so I changed the flag "ENV_LOCAL_RANK" as "SLURM_LOCALID", but I also tried "OMPI_COMM_WORLD_LOCAL_RANK" because I used OpenMPI, but the result of CUDA-aware MPI were much slower than normal one, when the GPUs on a same node (but if each GPU on the different node CUDA-aware MPI is a little bit faster than normal one), maybe I didn't activate CUDA-aware?

Does someone has any idea about this? Thanks a lot!

harrism commented 2 years ago

@jirikraus can you take a look at this issue?

jirikraus commented 2 years ago

Thanks for making me aware Mark. I would have missed this. I need to wrap up a few other things and will take a look at this later.

Mountain-ql commented 2 years ago

I found the reason is the local domain size, when I used the same hardware structure, that means 4 node and each node 1 A100 GPU, when the local domain size is 4096, the bandwidth is around 800 GB/s, but when the local domain size is 20480, the bandwidth is around 2.4 TB/s, are there some problems with the bandwidth algorithm?

jirikraus commented 2 years ago

Hi Mountain-ql, sorry for following up late. I did not have the time to deep dive into this yet. I agree that regarding the bandwidth calculations something is of. Regarding the performance difference between CUDA-aware MPI and regular MPI can you provide a few more details on your system? What exact MPI are you using (exact version and how it has been built) and the output of nvidia-smi topo -m on the system you are running on.

Mountain-ql commented 2 years ago

sorry for the late reply. the MPI I used was OpenMPI/4.0.5, because it is the module on HPC, so I don't know how it has been built. And the output of "nvidia-smi topo -m" is:

GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity GPU0 X NV12 SYS SYS 0 0-7 GPU1 NV12 X SYS SYS 0 0-7 mlx5_0 SYS SYS X SYS mlx5_1 SYS SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

jirikraus commented 2 years ago

Thanks. Can can you attach the output of ompi_info -c and ucx_info -b that will provide the missing information about the MPI you are using.

Mountain-ql commented 2 years ago

sorry for late reply! here is the output of "ompi_info -c": Configured by: hpcglrun Configured on: Wed Feb 17 12:42:06 CET 2021 Configure host: taurusi6395.taurus.hrsk.tu-dresden.de Configure command line: '--prefix=/sw/installed/OpenMPI/4.0.5-gcccuda-2020b' '--build=x86_64-pc-linux-gnu' '--host=x86_64-pc-linux-gnu' '--with-slurm' '--with-pmi=/usr' '--with-pmi-libdir=/usr/lib64' '--with-knem=/opt/knem-1.1.3.90mlnx1' '--enable-mpirun-prefix-by-default' '--enable-shared' '--with-cuda=/sw/installed/CUDAcore/11.1.1' '--with-hwloc=/sw/installed/hwloc/2.2.0-GCCcore-10.2.0' '--with-libevent=/sw/installed/libevent/2.1.12-GCCcore-10.2.0' '--with-ofi=/sw/installed/libfabric/1.11.0-GCCcore-10.2.0' '--with-pmix=/sw/installed/PMIx/3.1.5-GCCcore-10.2.0' '--with-ucx=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1' '--without-verbs' Built by: hpcglrun Built on: Wed Feb 17 12:50:42 CET 2021 Built host: taurusi6395.taurus.hrsk.tu-dresden.de C bindings: yes C++ bindings: no Fort mpif.h: yes (all) Fort use mpi: yes (full: ignore TKR) Fort use mpi size: deprecated-ompi-info-value Fort use mpi_f08: yes Fort mpi_f08 compliance: The mpi_f08 module is available, but due to limitations in the gfortran compiler and/or Open MPI, does not support the following: array subsections, direct passthru (where possible) to underlying Open MPI's C functionality Fort mpi_f08 subarrays: no Java bindings: no Wrapper compiler rpath: runpath C compiler: gcc C compiler absolute: /sw/installed/GCCcore/10.2.0/bin/gcc C compiler family name: GNU C compiler version: 10.2.0 C char size: 1 C bool size: 1 C short size: 2 C int size: 4 C long size: 8 C float size: 4 C double size: 8 C pointer size: 8 C char align: 1 C bool align: skipped C int align: 4 C float align: 4 C double align: 8 C++ compiler: g++ C++ compiler absolute: /sw/installed/GCCcore/10.2.0/bin/g++ Fort compiler: gfortran Fort compiler abs: /sw/installed/GCCcore/10.2.0/bin/gfortran Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::) Fort 08 assumed shape: yes Fort optional args: yes Fort INTERFACE: yes Fort ISO_FORTRAN_ENV: yes Fort STORAGE_SIZE: yes Fort BIND(C) (all): yes Fort ISO_C_BINDING: yes Fort SUBROUTINE BIND(C): yes Fort TYPE,BIND(C): yes Fort T,BIND(C,name="a"): yes Fort PRIVATE: yes Fort PROTECTED: yes Fort ABSTRACT: yes Fort ASYNCHRONOUS: yes Fort PROCEDURE: yes Fort USE...ONLY: yes Fort C_FUNLOC: yes Fort f08 using wrappers: yes Fort MPI_SIZEOF: yes Fort integer size: 4 Fort logical size: 4 Fort logical value true: 1 Fort have integer1: yes Fort have integer2: yes Fort have integer4: yes Fort have integer8: yes Fort have integer16: no Fort have real4: yes Fort have real8: yes Fort have real16: yes Fort have complex8: yes Fort have complex16: yes Fort have complex32: yes Fort integer1 size: 1 Fort integer2 size: 2 Fort integer4 size: 4 Fort integer8 size: 8 Fort integer16 size: -1 Fort real size: 4 Fort real4 size: 4 Fort real8 size: 8 Fort real16 size: 16 Fort dbl prec size: 8 Fort cplx size: 8 Fort dbl cplx size: 16 Fort cplx8 size: 8 Fort cplx16 size: 16 Fort cplx32 size: 32 Fort integer align: 4 Fort integer1 align: 1 Fort integer2 align: 2 Fort integer4 align: 4 Fort integer8 align: 8 Fort integer16 align: -1 Fort real align: 4 Fort real4 align: 4 Fort real8 align: 8 Fort real16 align: 16 Fort dbl prec align: 8 Fort cplx align: 4 Fort dbl cplx align: 8 Fort cplx8 align: 4 Fort cplx16 align: 8 Fort cplx32 align: 16 C profiling: yes C++ profiling: no Fort mpif.h profiling: yes Fort use mpi profiling: yes Fort use mpi_f08 prof: yes C++ exceptions: no Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, ORTE progress: yes, Event lib: yes) Sparse Groups: no Build CFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno -finline-functions -fno-strict-aliasing Build CXXFLAGS: -DNDEBUG -O3 -march=native -fno-math-errno -finline-functions Build FCFLAGS: -O3 -march=native -fno-math-errno Build LDFLAGS: -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib64 -L/sw/installed/PMIx/3.1.5-GCCcore-10.2.0/lib -L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib64 -L/sw/installed/libfabric/1.11.0-GCCcore-10.2.0/lib -L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib64 -L/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1/lib -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib64 -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib64 -L/sw/installed/zlib/1.2.11-GCCcore-10.2.0/lib -L/sw/installed/GCCcore/10.2.0/lib64 -L/sw/installed/GCCcore/10.2.0/lib -L/sw/installed/CUDAcore/11.1.1/lib64 -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 Build LIBS: -lutil -lm -lrt -lcudart -lpthread -lz -lhwloc -levent_core -levent_pthreads Wrapper extra CFLAGS: Wrapper extra CXXFLAGS: Wrapper extra FCFLAGS: -I${libdir} Wrapper extra LDFLAGS: -L/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -L/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 -Wl,-rpath -Wl,/sw/installed/hwloc/2.2.0-GCCcore-10.2.0/lib -Wl,-rpath -Wl,/sw/installed/libevent/2.1.12-GCCcore-10.2.0/lib64 -Wl,-rpath -Wl,@{libdir} -Wl,--enable-new-dtags Wrapper extra LIBS: -lhwloc -ldl -levent_core -levent_pthreads -lutil -lm -lrt -lcudart -lpthread -lz Internal debug support: no MPI interface warnings: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no dl support: yes Heterogeneous support: no mpirun default --prefix: yes MPI_WTIME support: native Symbol vis. support: yes Host topology support: yes IPv6 support: no MPI1 compatibility: no MPI extensions: affinity, cuda, pcollreq FT Checkpoint support: no (checkpoint thread: no) C/R Enabled Debugging: no MPI_MAX_PROCESSOR_NAME: 256 MPI_MAX_ERROR_STRING: 256 MPI_MAX_OBJECT_NAME: 64 MPI_MAX_INFO_KEY: 36 MPI_MAX_INFO_VAL: 256 MPI_MAX_PORT_NAME: 1024 MPI_MAX_DATAREP_STRING: 128

here is the output of "ucx_info -b":

define UCX_CONFIG_H

define ENABLE_BUILTIN_MEMCPY 1

define ENABLE_DEBUG_DATA 0

define ENABLE_MT 1

define ENABLE_PARAMS_CHECK 0

define ENABLE_SYMBOL_OVERRIDE 1

define HAVE_1_ARG_BFD_SECTION_SIZE 1

define HAVE_ALLOCA 1

define HAVE_ALLOCA_H 1

define HAVE_ATTRIBUTE_NOOPTIMIZE 1

define HAVE_CLEARENV 1

define HAVE_CPLUS_DEMANGLE 1

define HAVE_CPU_SET_T 1

define HAVE_CUDA 1

define HAVE_CUDA_H 1

define HAVE_CUDA_RUNTIME_H 1

define HAVE_DC_EXP 1

define HAVE_DECL_ASPRINTF 1

define HAVE_DECL_BASENAME 1

define HAVE_DECL_BFD_GET_SECTION_FLAGS 0

define HAVE_DECL_BFD_GET_SECTION_VMA 0

define HAVE_DECL_BFD_SECTION_FLAGS 1

define HAVE_DECL_BFD_SECTION_VMA 1

define HAVE_DECL_CPU_ISSET 1

define HAVE_DECL_CPU_ZERO 1

define HAVE_DECL_ETHTOOL_CMD_SPEED 1

define HAVE_DECL_FMEMOPEN 1

define HAVE_DECL_F_SETOWN_EX 1

define HAVE_DECL_GDR_COPY_TO_MAPPING 1

define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1

define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 0

define HAVE_DECL_IBV_ADVISE_MR 0

define HAVE_DECL_IBV_ALLOC_DM 0

define HAVE_DECL_IBV_ALLOC_TD 0

define HAVE_DECL_IBV_CMD_MODIFY_QP 1

define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 0

define HAVE_DECL_IBV_CREATE_QP_EX 1

define HAVE_DECL_IBV_CREATE_SRQ 1

define HAVE_DECL_IBV_CREATE_SRQ_EX 1

define HAVE_DECL_IBV_EVENT_GID_CHANGE 1

define HAVE_DECL_IBV_EVENT_TYPE_STR 1

define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 1

define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 1

define HAVE_DECL_IBV_EXP_ALLOC_DM 1

define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 1

define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 1

define HAVE_DECL_IBV_EXP_CQ_MODERATION 1

define HAVE_DECL_IBV_EXP_CREATE_QP 1

define HAVE_DECL_IBV_EXP_CREATE_RES_DOMAIN 1

define HAVE_DECL_IBV_EXP_CREATE_SRQ 1

define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 1

define HAVE_DECL_IBV_EXP_DESTROY_RES_DOMAIN 1

define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 1

define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 1

define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 1

define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 1

define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 1

define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 1

define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 1

define HAVE_DECL_IBV_EXP_POST_SEND 1

define HAVE_DECL_IBV_EXP_PREFETCH_MR 1

define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 1

define HAVE_DECL_IBV_EXP_QPT_DC_INI 1

define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 1

define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 1

define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_RES_DOMAIN 1

define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 1

define HAVE_DECL_IBV_EXP_QUERY_DEVICE 1

define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 1

define HAVE_DECL_IBV_EXP_REG_MR 1

define HAVE_DECL_IBV_EXP_RES_DOMAIN_THREAD_MODEL 1

define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 1

define HAVE_DECL_IBV_EXP_SETENV 1

define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 1

define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 1

define HAVE_DECL_IBV_EXP_WR_NOP 1

define HAVE_DECL_IBV_GET_ASYNC_EVENT 1

define HAVE_DECL_IBV_GET_DEVICE_NAME 1

define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1

define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1

define HAVE_DECL_IBV_MLX5_EXP_GET_CQ_INFO 1

define HAVE_DECL_IBV_MLX5_EXP_GET_QP_INFO 1

define HAVE_DECL_IBV_MLX5_EXP_GET_SRQ_INFO 1

define HAVE_DECL_IBV_MLX5_EXP_UPDATE_CQ_CI 1

define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 0

define HAVE_DECL_IBV_QPF_GRH_REQUIRED 0

define HAVE_DECL_IBV_QUERY_DEVICE_EX 1

define HAVE_DECL_IBV_QUERY_GID 1

define HAVE_DECL_IBV_WC_STATUS_STR 1

define HAVE_DECL_MADV_FREE 0

define HAVE_DECL_MADV_REMOVE 1

define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 0

define HAVE_DECL_MLX5DV_CREATE_QP 0

define HAVE_DECL_MLX5DV_DCTYPE_DCT 0

define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 0

define HAVE_DECL_MLX5DV_INIT_OBJ 1

define HAVE_DECL_MLX5DV_IS_SUPPORTED 0

define HAVE_DECL_MLX5DV_OBJ_AH 0

define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 0

define HAVE_DECL_MLX5_WQE_CTRL_SOLICITED 1

define HAVE_DECL_POSIX_MADV_DONTNEED 1

define HAVE_DECL_PR_SET_PTRACER 1

define HAVE_DECL_RDMA_ESTABLISH 1

define HAVE_DECL_RDMA_INIT_QP_ATTR 1

define HAVE_DECL_SPEED_UNKNOWN 1

define HAVE_DECL_STRERROR_R 1

define HAVE_DECL_SYS_BRK 1

define HAVE_DECL_SYS_IPC 0

define HAVE_DECL_SYS_MADVISE 1

define HAVE_DECL_SYS_MMAP 1

define HAVE_DECL_SYS_MREMAP 1

define HAVE_DECL_SYS_MUNMAP 1

define HAVE_DECL_SYS_SHMAT 1

define HAVE_DECL_SYS_SHMDT 1

define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0

define HAVE_DETAILED_BACKTRACE 1

define HAVE_DLFCN_H 1

define HAVE_EXP_UMR 1

define HAVE_EXP_UMR_KSM 1

define HAVE_GDRAPI_H 1

define HAVE_HW_TIMER 1

define HAVE_IB 1

define HAVE_IBV_DM 1

define HAVE_IBV_EXP_DM 1

define HAVE_IBV_EXP_QP_CREATE_UMR 1

define HAVE_IBV_EXP_RES_DOMAIN 1

define HAVE_IB_EXT_ATOMICS 1

define HAVE_IN6_ADDR_S6_ADDR32 1

define HAVE_INFINIBAND_MLX5DV_H 1

define HAVE_INFINIBAND_MLX5_HW_H 1

define HAVE_INTTYPES_H 1

define HAVE_IP_IP_DST 1

define HAVE_LIBGEN_H 1

define HAVE_LIBRT 1

define HAVE_LINUX_FUTEX_H 1

define HAVE_LINUX_IP_H 1

define HAVE_LINUX_MMAN_H 1

define HAVE_MALLOC_GET_STATE 1

define HAVE_MALLOC_H 1

define HAVE_MALLOC_HOOK 1

define HAVE_MALLOC_SET_STATE 1

define HAVE_MALLOC_TRIM 1

define HAVE_MASKED_ATOMICS_ENDIANNESS 1

define HAVE_MEMALIGN 1

define HAVE_MEMORY_H 1

define HAVE_MLX5_HW 1

define HAVE_MLX5_HW_UD 1

define HAVE_MREMAP 1

define HAVE_NETINET_IP_H 1

define HAVE_NET_ETHERNET_H 1

define HAVE_NUMA 1

define HAVE_NUMAIF_H 1

define HAVE_NUMA_H 1

define HAVE_ODP 1

define HAVE_ODP_IMPLICIT 1

define HAVE_POSIX_MEMALIGN 1

define HAVE_PREFETCH 1

define HAVE_RDMACM_QP_LESS 1

define HAVE_SCHED_GETAFFINITY 1

define HAVE_SCHED_SETAFFINITY 1

define HAVE_SIGACTION_SA_RESTORER 1

define HAVE_SIGEVENT_SIGEV_UN_TID 1

define HAVE_SIGHANDLER_T 1

define HAVE_STDINT_H 1

define HAVE_STDLIB_H 1

define HAVE_STRERROR_R 1

define HAVE_STRINGS_H 1

define HAVE_STRING_H 1

define HAVE_STRUCT_BITMASK 1

define HAVE_STRUCT_DL_PHDR_INFO 1

define HAVE_STRUCT_IBV_ASYNC_EVENT_ELEMENT_DCT 1

define HAVE_STRUCT_IBV_EXP_CREATE_SRQ_ATTR_DC_OFFLOAD_PARAMS 1

define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_EXP_DEVICE_CAP_FLAGS 1

define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS 1

define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_CAPS_PER_TRANSPORT_CAPS_DC_ODP_CAPS 1

define HAVE_STRUCT_IBV_EXP_DEVICE_ATTR_ODP_MR_MAX_SIZE 1

define HAVE_STRUCT_IBV_EXP_QP_INIT_ATTR_MAX_INL_RECV 1

define HAVE_STRUCT_IBV_MLX5_QP_INFO_BF_NEED_LOCK 1

define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1

define HAVE_STRUCT_MLX5_AH_IBV_AH 1

define HAVE_STRUCT_MLX5_CQE64_IB_STRIDE_INDEX 1

define HAVE_STRUCT_MLX5_GRH_AV_RMAC 1

define HAVE_STRUCT_MLX5_SRQ_CMD_QP 1

define HAVE_STRUCT_MLX5_WQE_AV_BASE 1

define HAVE_SYS_EPOLL_H 1

define HAVE_SYS_EVENTFD_H 1

define HAVE_SYS_STAT_H 1

define HAVE_SYS_TYPES_H 1

define HAVE_SYS_UIO_H 1

define HAVE_TL_DC 1

define HAVE_TL_RC 1

define HAVE_TL_UD 1

define HAVE_UCM_PTMALLOC286 1

define HAVE_UNISTD_H 1

define HAVE_VERBS_EXP_H 1

define HAVE___CLEAR_CACHE 1

define HAVE___CURBRK 1

define HAVE___SIGHANDLER_T 1

define IBV_HW_TM 1

define LT_OBJDIR ".libs/"

define NVALGRIND 1

define PACKAGE "ucx"

define PACKAGE_BUGREPORT ""

define PACKAGE_NAME "ucx"

define PACKAGE_STRING "ucx 1.9"

define PACKAGE_TARNAME "ucx"

define PACKAGE_URL ""

define PACKAGE_VERSION "1.9"

define STDC_HEADERS 1

define STRERROR_R_CHAR_P 1

define UCM_BISTRO_HOOKS 1

define UCS_MAX_LOG_LEVEL UCS_LOG_LEVEL_INFO

define UCT_UD_EP_DEBUG_HOOKS 0

define UCX_CONFIGURE_FLAGS "--disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/sw/installed/UCX/1.9.0-GCCcore-10.2.0-CUDA-11.1.1 --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc --with-cuda=/sw/installed/CUDAcore/11.1.1 --with-gdrcopy=/sw/installed/GDRCopy/2.1-GCCcore-10.2.0-CUDA-11.1.1"

define UCX_MODULE_SUBDIR "ucx"

define VERSION "1.9"

define restrict __restrict

define test_MODULES ":module"

define ucm_MODULES ":cuda"

define uct_MODULES ":cuda:ib:rdmacm:cma"

define uct_cuda_MODULES ":gdrcopy"

define uct_ib_MODULES ":cm"

define uct_rocm_MODULES ""

define ucx_perftest_MODULES ":cuda"

jirikraus commented 2 years ago

Thanks. I can't spot anything regarding your Software Setup. As the performance difference between CUDA-aware MPI and regular MPI on a single node is about 2x and CUDA-aware MPI is faster for 2 processes on two nodes I am suspecting there is an issue with the GPU affinity handling. I.e. ENV_LOCAL_RANK defined the wrong way (but you seem to have that right) or that on the system you are using CUDA_VISIBLE_DEVICES is set in a funky way. As this code has not been updated for quite some time can you try with https://github.com/NVIDIA/multi-gpu-programming-models (also a Jacobi solver but with a simpler code that I regularly use in tutorials). I also checked the math for the bandwidth: The formula used does not consider caches, see https://github.com/NVIDIA-developer-blog/code-samples/blob/master/posts/cuda-aware-mpi-example/src/Host.c#L291 which explains why you are seeing too large memory bandwidths.

Mountain-ql commented 2 years ago

Thanks a lot!!

jirikraus commented 2 years ago

Thanks for the feedback. Closing this as it does not seem to be an issue with the code.

Shihab-Shahriar commented 3 months ago

I want to add a data point that I am also finding the cuda-aware version to be slightly slower than the normal version.

Similar setup, single node, 2 A100 GPUs. 2 Ranks here. Results are similar for 2x2 topology.

Normal:

Measured lattice updates: 48.04 GLU/s (total), 24.02 GLU/s (per process) 
Measured FLOPS: 240.19 GFLOPS (total), 120.10 GFLOPS (per process)
Measured device bandwidth: 3.07 TB/s (total), 1.54 TB/s (per process)

Cuda-aware:

Measured lattice updates: 46.83 GLU/s (total), 23.41 GLU/s (per process)
Measured FLOPS: 234.13 GFLOPS (total), 117.07 GFLOPS (per process)
Measured device bandwidth: 3.00 TB/s (total), 1.50 TB/s (per process) 

Please note that to make this code work, I had to change the device code to initialize after MPI init (I am using OpenMPI 5.0.3). The MPI has been compiled with cuda support and Openfabrics (ofi).

I set CUDA_VISIBLE_DEVICES=0,1. I do watch -n .5 nvidia-smi while running stuff and it's using the first two devices only.

jirikraus commented 3 months ago

@Shihab-Shahriar have you profiled the code with NSight Systems and checked what happens during the MPI calls? I am wondering if for some reason for the communication between the two A100s in the node NVLINK (i.e. GPUDirect P2P via CUDA IPC) is not used. If that is the case CUDA-aware MPI would stage via CPU memory in which case it could be expected that normal MPI is faster.

An example how to do this can be found here: https://github.com/NVIDIA/multi-gpu-programming-models/blob/master/mpi/Makefile#L45