When running AMGX on a too large case for the GPU it reports the following error
Thrust failure: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.
when calling AMGX_solver_setup. Following this we try to reset the AMGX solver but when AMGX_solver_destroy is called it crashes the application (despite being done within a try-catch block) with the following:
I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in AMGX_solver_setup needs some special handling.
Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.
I tried skipping the call to AMGX_solver_destroy (proceeding with the rest of the *destroy commands and finalize commands, but then I run into the !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!! error which makes sense since the solver object isn't destroyed in the intended order.
Environment information:
OS: Ubuntu 22.04 (through WSL on Windows 11)
CUDA runtime: CUDA 11.7.1
MPI version (if applicable): Not applicable
AMGX version or commit hash v2.3.0 + cherry picked 8bb693b42acc64c1893835d95858cad350c790c1
NVIDIA driver: 528.24 (probably the Windows driver version as nvidia-smi reports the same version in Windows + WSL)
NVIDIA GPU: RTX4080
Any related environment variables information: Not applicable
Same problem has been reported on same build but for at least a RTX3090 card as well.
Describe the issue
When running AMGX on a too large case for the GPU it reports the following error
when calling
AMGX_solver_setup
. Following this we try to reset the AMGX solver but whenAMGX_solver_destroy
is called it crashes the application (despite being done within a try-catch block) with the following:I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in
AMGX_solver_setup
needs some special handling.Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.
I tried skipping the call to
AMGX_solver_destroy
(proceeding with the rest of the*destroy
commands andfinalize
commands, but then I run into the!!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!!
error which makes sense since the solver object isn't destroyed in the intended order.Environment information:
Ubuntu 22.04
(through WSL on Windows 11)CUDA 11.7.1
v2.3.0
+ cherry picked8bb693b42acc64c1893835d95858cad350c790c1
nvidia-smi
reports the same version in Windows + WSL)Same problem has been reported on same build but for at least a RTX3090 card as well.
AMGX solver configuration
Matrix Data
My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.
Reproduction steps
Call order:
AMGX_solver_register_print_callback
AMGX_initialize
AMGX_initialize_plugins
AMGX_install_signal_handler
AMGX_config_create
(global config)AMGX_resources_create_simple
AMGX_config_create
(for the specific solver)AMGX_matrix_create
AMGX_vector_create
(both rhs and solution)AMGX_solver_create
AMGX_matrix_upload_all
AMGX_solver_setup
AMGX_solver_destroy
Additional context
-