Closed pledac closed 2 years ago
The MPI examples are using a very simple and naive distribution of the matrix among the processes. This could lead to a very bad communication pattern and thus in a significant increase in runtime. I highly suggest to do the matrix partitioning by yourself, not relying on the example code.
Closing as there is nothing to fix in rocALUTION
Ok thanks for the info. We have in our code a balanced partition and for the moment the rocALUTION solver in our test case, the performance doesn't vary much (but is not degrading as cg_mpi) depending on the number of GPUs used on the node. Still investigating...
Hello,
I send you a modified test (cg_mpi.cpp) and data to investigate some performance issue we have here on a node with AMD EPYC 7452 and 8 MI100.
On 1 GPU, the performance is good with strong speed up compared to 1 CPU (x30):
HIP_VISIBLE_DEVICES=1 srun --gres=gpu:2 --threads-per-core=1 -n 1 ./cg_mpi cg_mpi.mtx
Number of HIP devices in the system: 1 No OpenMP support rocALUTION ver 2.1.2-985838b rocALUTION platform is initialized Accelerator backend: HIP No OpenMP support rocBLAS ver 2.41.0. rocSPARSE ver 1.22.2- Selected HIP device: 0Device number: 0 Device name: totalGlobalMem: 32752 MByte clockRate: 1502000 compute capability: 9.0
MPI rank:0 MPI size:1 ReadFileMTX: filename=cg_mpi.mtx; reading... ReadFileMTX: filename=cg_mpi.mtx; done GlobalMatrix name=mat; rows=3068917; cols=3068917; nnz=35209033; prec=64bit; format=CSR/CSR; subdomains=1; host backend={CPU}; accelerator backend={HIP}; current=HIP PCG solver starts, with preconditioner: Jacobi preconditioner IterationControl criteria: abs tol=1e-15; rel tol=1e-06; div tol=1e+08; min iter=100; max iter=1000000 IterationControl initial residual = 6.69104e+13 IterationControl iter=1; residual=0.0168304 IterationControl iter=2; residual=0.00471349 ... IterationControl iter=99; residual=0.000182203 IterationControl iter=100; residual=0.000192498 IterationControl RELATIVE criteria has been reached: res norm=0.000192498; rel val=2.87696e-18; iter=100 PCG ends Solving: 0.158445 sec ||e - x||_2 = 1593.63
On 2 GPU, the performance is not there (whereas with a C++ code using OpenMP offload, we have a ~2 speedup):
HIP_VISIBLE_DEVICES=1,2 srun --gres=gpu:3 --threads-per-core=1 -n 2 ./cg_mpi cg_mpi.mtx
Number of HIP devices in the system: 2 No OpenMP support rocALUTION ver 2.1.2-985838b rocALUTION platform is initialized Accelerator backend: HIP No OpenMP support rocBLAS ver 2.41.0. rocSPARSE ver 1.22.2- Selected HIP device: 0Device number: 0 Device name: totalGlobalMem: 32752 MByte clockRate: 1502000 compute capability: 9.0
Device number: 1 Device name: totalGlobalMem: 32752 MByte clockRate: 1502000 compute capability: 9.0
MPI rank:0 MPI size:2 ReadFileMTX: filename=cg_mpi.mtx; reading... ReadFileMTX: filename=cg_mpi.mtx; done GlobalMatrix name=mat; rows=3068917; cols=3068917; nnz=35209033; prec=64bit; format=CSR/COO; subdomains=2; host backend={CPU}; accelerator backend={HIP}; current=HIP PCG solver starts, with preconditioner: Jacobi preconditioner IterationControl criteria: abs tol=1e-15; rel tol=1e-06; div tol=1e+08; min iter=100; max iter=1000000 IterationControl initial residual = 6.69104e+13 IterationControl iter=1; residual=0.0193062 IterationControl iter=2; residual=0.00469472 ... IterationControl iter=99; residual=0.000188159 IterationControl iter=100; residual=0.000191187 IterationControl RELATIVE criteria has been reached: res norm=0.000191187; rel val=2.85736e-18; iter=100 PCG ends Solving: 0.562395 sec ||e - x||_2 = 1593.63
Is there something I miss to run/configure cg_mpi on several GPU ?
I can't provide you the cg_mpi.mtx file (file size too big). Could I send you by another way ?
Thanks,