Open Samev opened 10 months ago
I'm getting the same error and it is quite confusing. While "Illegal memory access" is probably why the solver crashes the error message should probably say that the illegal access happens due to out of memory. The error can be reproduced using the amg_mpi_poisson7 example:
mpirun -np 1 ./amgx_mpi_poisson7 -mode dDDI -p 600 600 600 1 1 1 -c ./../configs/PCG_AGGREGATION_JACOBI.json
For 500³, on a A100-80Gb the solver passes but for 600³ grid the solver crashes. We use AMGx as a part of a flow solver and we have other GPU memory requirements and in practice this means that the error occurs already at cell counts with 20M cells.
@hamsteri15 Classical multigrid is quite memory hungry. I can suggest you adding aggressive_levels
and/or max_row_sum
to the amg configuration (see examples https://github.com/NVIDIA/AMGX/blob/main/src/configs/AMG_CLASSICAL_AGGRESSIVE_L1_TRUNC.json or https://github.com/NVIDIA/AMGX/blob/main/src/configs/FGMRES_CLASSICAL_AGGRESSIVE_PMIS.json ) to reduce memory usage.
@marsaev Do you have any input on the original issue? I.e. should it be possible to gracefully destruct the AMGX solver if one runs into an out of memory error?
Or maybe this isn't an out of memory error at all and we are simply misintepreting it as such?
Describe the issue
When running AMGX on a too large case for the GPU it reports the following error
when calling
AMGX_solver_setup
. Following this we try to reset the AMGX solver but whenAMGX_solver_destroy
is called it crashes the application (despite being done within a try-catch block) with the following:I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in
AMGX_solver_setup
needs some special handling.Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.
I tried skipping the call to
AMGX_solver_destroy
(proceeding with the rest of the*destroy
commands andfinalize
commands, but then I run into the!!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!!
error which makes sense since the solver object isn't destroyed in the intended order.Environment information:
Ubuntu 22.04
(through WSL on Windows 11)CUDA 11.7.1
v2.3.0
+ cherry picked8bb693b42acc64c1893835d95858cad350c790c1
nvidia-smi
reports the same version in Windows + WSL)Same problem has been reported on same build but for at least a RTX3090 card as well.
AMGX solver configuration
Matrix Data
My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.
Reproduction steps
Call order:
AMGX_solver_register_print_callback
AMGX_initialize
AMGX_initialize_plugins
AMGX_install_signal_handler
AMGX_config_create
(global config)AMGX_resources_create_simple
AMGX_config_create
(for the specific solver)AMGX_matrix_create
AMGX_vector_create
(both rhs and solution)AMGX_solver_create
AMGX_matrix_upload_all
AMGX_solver_setup
AMGX_solver_destroy
Additional context
-