Performance issue with cg_mpi running on 2 GPU or more

pledac commented 2 years ago

Hello,

I send you a modified test (cg_mpi.cpp) and data to investigate some performance issue we have here on a node with AMD EPYC 7452 and 8 MI100.

On 1 GPU, the performance is good with strong speed up compared to 1 CPU (x30): HIP_VISIBLE_DEVICES=1 srun --gres=gpu:2 --threads-per-core=1 -n 1 ./cg_mpi cg_mpi.mtx Number of HIP devices in the system: 1 No OpenMP support rocALUTION ver 2.1.2-985838b rocALUTION platform is initialized Accelerator backend: HIP No OpenMP support rocBLAS ver 2.41.0. rocSPARSE ver 1.22.2- Selected HIP device: 0

Device number: 0 Device name: totalGlobalMem: 32752 MByte clockRate: 1502000 compute capability: 9.0

MPI rank:0 MPI size:1 ReadFileMTX: filename=cg_mpi.mtx; reading... ReadFileMTX: filename=cg_mpi.mtx; done GlobalMatrix name=mat; rows=3068917; cols=3068917; nnz=35209033; prec=64bit; format=CSR/CSR; subdomains=1; host backend={CPU}; accelerator backend={HIP}; current=HIP PCG solver starts, with preconditioner: Jacobi preconditioner IterationControl criteria: abs tol=1e-15; rel tol=1e-06; div tol=1e+08; min iter=100; max iter=1000000 IterationControl initial residual = 6.69104e+13 IterationControl iter=1; residual=0.0168304 IterationControl iter=2; residual=0.00471349 ... IterationControl iter=99; residual=0.000182203 IterationControl iter=100; residual=0.000192498 IterationControl RELATIVE criteria has been reached: res norm=0.000192498; rel val=2.87696e-18; iter=100 PCG ends Solving: 0.158445 sec ||e - x||_2 = 1593.63

On 2 GPU, the performance is not there (whereas with a C++ code using OpenMP offload, we have a ~2 speedup): HIP_VISIBLE_DEVICES=1,2 srun --gres=gpu:3 --threads-per-core=1 -n 2 ./cg_mpi cg_mpi.mtx Number of HIP devices in the system: 2 No OpenMP support rocALUTION ver 2.1.2-985838b rocALUTION platform is initialized Accelerator backend: HIP No OpenMP support rocBLAS ver 2.41.0. rocSPARSE ver 1.22.2- Selected HIP device: 0

Device number: 0 Device name: totalGlobalMem: 32752 MByte clockRate: 1502000 compute capability: 9.0

Device number: 1 Device name: totalGlobalMem: 32752 MByte clockRate: 1502000 compute capability: 9.0

MPI rank:0 MPI size:2 ReadFileMTX: filename=cg_mpi.mtx; reading... ReadFileMTX: filename=cg_mpi.mtx; done GlobalMatrix name=mat; rows=3068917; cols=3068917; nnz=35209033; prec=64bit; format=CSR/COO; subdomains=2; host backend={CPU}; accelerator backend={HIP}; current=HIP PCG solver starts, with preconditioner: Jacobi preconditioner IterationControl criteria: abs tol=1e-15; rel tol=1e-06; div tol=1e+08; min iter=100; max iter=1000000 IterationControl initial residual = 6.69104e+13 IterationControl iter=1; residual=0.0193062 IterationControl iter=2; residual=0.00469472 ... IterationControl iter=99; residual=0.000188159 IterationControl iter=100; residual=0.000191187 IterationControl RELATIVE criteria has been reached: res norm=0.000191187; rel val=2.85736e-18; iter=100 PCG ends Solving: 0.562395 sec ||e - x||_2 = 1593.63

Is there something I miss to run/configure cg_mpi on several GPU ?

I can't provide you the cg_mpi.mtx file (file size too big). Could I send you by another way ?

Thanks,

ntrost57 commented 2 years ago

The MPI examples are using a very simple and naive distribution of the matrix among the processes. This could lead to a very bad communication pattern and thus in a significant increase in runtime. I highly suggest to do the matrix partitioning by yourself, not relying on the example code.

doctorcolinsmith commented 2 years ago

Closing as there is nothing to fix in rocALUTION

pledac commented 2 years ago

Ok thanks for the info. We have in our code a balanced partition and for the moment the rocALUTION solver in our test case, the performance doesn't vary much (but is not degrading as cg_mpi) depending on the number of GPUs used on the node. Still investigating...

ROCm / rocALUTION

Performance issue with cg_mpi running on 2 GPU or more #152