ROCm / rocALUTION

Next generation library for iterative sparse solvers for ROCm platform
https://rocm.docs.amd.com/projects/rocALUTION/en/latest/
MIT License
74 stars 38 forks source link

PairwiseAMG crash in parallel #195

Closed pledac closed 6 months ago

pledac commented 9 months ago

Hello, I am experiencing a crash with PairwiseAMG when used as a CG preconditioner in parallel (I am using f72a3919b52b rocALUTION version on CPU).

I reproduced the crash with the cg-amg_mpi sample and the gr_30_30.mtx matrix (it works with less MPI ranks):

$ mpirun -np 31 /export/home/catA/pl254994/trust/amgx_openmp/lib/src/LIBROCALUTION/clients/staging/cg-amg_mpi gr_30_30.mtx
No OpenMP support
rocALUTION ver 3.0.3-59debfadc-dirty
rocALUTION platform is initialized
Accelerator backend: None
No OpenMP support
MPI rank: 0
MPI size: 31
ReadFileMTX: filename=gr_30_30.mtx; reading...
ReadFileMTX: filename=gr_30_30.mtx; done
double free or corruption (out)
double free or corruption (out)
double free or corruption (out)
double free or corruption (out)
double free or corruption (out)

On my matrix (2 592 000 rows) in my code, it crashes above 7 MPI ranks (C-amg and SA-amg works fine):

....
[rocALUTION] Time to convert TRUST matrix: 0.509774
[rocALUTION] Build a matrix with local N= 324001 and local nnz=1598762
[rocALUTION] Time to build matrix: 0.046605
GlobalMatrix name=mat; rows=2592000; cols=2592000; nnz=12867840; prec=64bit; format=CSR(32,32)/COO; subdomains=8; host backend={CPU}; accelerator backend={None}; current=CPU
[rocALUTION] Time to copy matrix on device: 1.4e-05
munmap_chunk(): invalid pointer

Thanks, for you help.

ntrost57 commented 9 months ago

I can reproduce it. We will look into it, thanks!

ntrost57 commented 6 months ago

With the recent release of multi-node / multi-GPU support of the superior (un)smoothed aggregation and classic AMG, we are going to remove PairwiseAMG in one of the future releases.