NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
468 stars 136 forks source link

[Issue] Recovering from out of memory error #289

Open Samev opened 6 months ago

Samev commented 6 months ago

Describe the issue

When running AMGX on a too large case for the GPU it reports the following error

Thrust failure: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.

when calling AMGX_solver_setup. Following this we try to reset the AMGX solver but when AMGX_solver_destroy is called it crashes the application (despite being done within a try-catch block) with the following:

terminate called after throwing an instance of 'amgx::amgx_exception'
  what():  Cuda failure: 'an illegal memory access was encountered'

 /<censored>/lib/libamgxsh.so : amgx::handle_signals(int)+0xa2
 /lib/x86_64-linux-gnu/libc.so.6 : ()+0x42520
 /lib/x86_64-linux-gnu/libc.so.6 : pthread_kill()+0x12c
 /lib/x86_64-linux-gnu/libc.so.6 : raise()+0x16
 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0xd3
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xa2b9e
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xae20c
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xad1e9
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __gxx_personality_v0()+0x99
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : ()+0x16884
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : _Unwind_RaiseException()+0x311
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __cxa_throw()+0x3b
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0x998
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::~AMG()+0x42
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0x26
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0x35
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AMG_Solver()+0x180
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x16
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::CWrapHandle<AMGX_solver_handle_struct*, amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x56
 /<censored>/lib/libamgxsh.so : ()+0x1394590
 /<censored>/lib/libamgxsh.so : AMGX_solver_destroy()+0xe24

I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in AMGX_solver_setup needs some special handling.

Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.

I tried skipping the call to AMGX_solver_destroy (proceeding with the rest of the *destroy commands and finalize commands, but then I run into the !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!! error which makes sense since the solver object isn't destroyed in the intended order.

Environment information:

Same problem has been reported on same build but for at least a RTX3090 card as well.

AMGX solver configuration

config_version=2,
determinism_flag=0,
solver(mainSolver)=PBICGSTAB,
mainSolver:preconditioner(precon)=AMG,
precon:cycle=V,
precon:max_levels=15,
precon:selector=PMIS,
precon:smoother(smooth)=BLOCK_JACOBI,
precon:presweeps=1,
precon:postsweeps=1,
precon:max_iters=1,
precon:interpolator=D2,
precon:interp_max_elements=6,
mainSolver:monitor_residual=1,
mainSolver:store_res_history=1,
mainSolver:norm=L2,
mainSolver:print_vis_data=1,
mainSolver:max_iters=10000,
mainSolver:tolerance=1e-09,
mainSolver:gmres_n_restart=30,
mainSolver:convergence=RELATIVE_INI_CORE

Matrix Data

My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.

Reproduction steps

Call order:

Additional context

-