NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
501 stars 143 forks source link

[Issue] Recovering from out of memory error #289

Open Samev opened 10 months ago

Samev commented 10 months ago

Describe the issue

When running AMGX on a too large case for the GPU it reports the following error

Thrust failure: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.

when calling AMGX_solver_setup. Following this we try to reset the AMGX solver but when AMGX_solver_destroy is called it crashes the application (despite being done within a try-catch block) with the following:

terminate called after throwing an instance of 'amgx::amgx_exception'
  what():  Cuda failure: 'an illegal memory access was encountered'

 /<censored>/lib/libamgxsh.so : amgx::handle_signals(int)+0xa2
 /lib/x86_64-linux-gnu/libc.so.6 : ()+0x42520
 /lib/x86_64-linux-gnu/libc.so.6 : pthread_kill()+0x12c
 /lib/x86_64-linux-gnu/libc.so.6 : raise()+0x16
 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0xd3
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xa2b9e
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xae20c
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xad1e9
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __gxx_personality_v0()+0x99
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : ()+0x16884
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : _Unwind_RaiseException()+0x311
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __cxa_throw()+0x3b
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0x998
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::~AMG()+0x42
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0x26
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0x35
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AMG_Solver()+0x180
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x16
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::CWrapHandle<AMGX_solver_handle_struct*, amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x56
 /<censored>/lib/libamgxsh.so : ()+0x1394590
 /<censored>/lib/libamgxsh.so : AMGX_solver_destroy()+0xe24

I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in AMGX_solver_setup needs some special handling.

Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.

I tried skipping the call to AMGX_solver_destroy (proceeding with the rest of the *destroy commands and finalize commands, but then I run into the !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!! error which makes sense since the solver object isn't destroyed in the intended order.

Environment information:

Same problem has been reported on same build but for at least a RTX3090 card as well.

AMGX solver configuration

config_version=2,
determinism_flag=0,
solver(mainSolver)=PBICGSTAB,
mainSolver:preconditioner(precon)=AMG,
precon:cycle=V,
precon:max_levels=15,
precon:selector=PMIS,
precon:smoother(smooth)=BLOCK_JACOBI,
precon:presweeps=1,
precon:postsweeps=1,
precon:max_iters=1,
precon:interpolator=D2,
precon:interp_max_elements=6,
mainSolver:monitor_residual=1,
mainSolver:store_res_history=1,
mainSolver:norm=L2,
mainSolver:print_vis_data=1,
mainSolver:max_iters=10000,
mainSolver:tolerance=1e-09,
mainSolver:gmres_n_restart=30,
mainSolver:convergence=RELATIVE_INI_CORE

Matrix Data

My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.

Reproduction steps

Call order:

Additional context

-

hamsteri15 commented 1 month ago

I'm getting the same error and it is quite confusing. While "Illegal memory access" is probably why the solver crashes the error message should probably say that the illegal access happens due to out of memory. The error can be reproduced using the amg_mpi_poisson7 example:

mpirun -np 1 ./amgx_mpi_poisson7 -mode dDDI -p 600 600 600 1 1 1 -c ./../configs/PCG_AGGREGATION_JACOBI.json

log500.txt log600.txt

For 500³, on a A100-80Gb the solver passes but for 600³ grid the solver crashes. We use AMGx as a part of a flow solver and we have other GPU memory requirements and in practice this means that the error occurs already at cell counts with 20M cells.

marsaev commented 3 weeks ago

@hamsteri15 Classical multigrid is quite memory hungry. I can suggest you adding aggressive_levels and/or max_row_sum to the amg configuration (see examples https://github.com/NVIDIA/AMGX/blob/main/src/configs/AMG_CLASSICAL_AGGRESSIVE_L1_TRUNC.json or https://github.com/NVIDIA/AMGX/blob/main/src/configs/FGMRES_CLASSICAL_AGGRESSIVE_PMIS.json ) to reduce memory usage.

Samev commented 2 weeks ago

@marsaev Do you have any input on the original issue? I.e. should it be possible to gracefully destruct the AMGX solver if one runs into an out of memory error?

Or maybe this isn't an out of memory error at all and we are simply misintepreting it as such?