NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
497 stars 143 forks source link

"Cannot allocate pinned memory" error on a supercomputer #316

Open jooj211 opened 4 months ago

jooj211 commented 4 months ago

I am encountering a "Cannot allocate pinned memory" error while running a program that uses AMGX solvers on a supercomputer that uses the SLURM Workload Manager. The program fails to allocate the necessary pinned memory for efficient GPU memory transfers.

Here's the full output file:

Nonlinear Elasticity FEM Solver
Updated Lagrangian Formulation
Setting material and elasticity type
Setting boundary conditions
Setup of non-linear elasticity problem
Reading parameters from XML file
Creating Incompressible Material
Bulk modulus: 300
 Hyperelastic material: Guccione
 Material properties: 10 1 1 1 0 300
 Number of nodal loads: 0
 Number of prescribed displ.: 249
 Number of traction (Neumann) loads: 0
 Number of dirichlet boundary conds:0
 Number of normal pressure loads: 1
 Number of spring boundary conds: 0
Solving problem
Reading XML mesh
Reading XML mesh file: ./prob2_12x27x2_k300.xml
Fiber model: fiber_isotropic
Mesh information
 Number of dimensions: 3
 Number of nodes: 975
 Number of elements: 648
 Number of boundary elements: 324
Output step: 1
Solving nonlinear problem (UL) using NonlinearSolver
Initial inner volume: 3189.27
Initial cavity volume: 0
Size of the problem 2925
Matrices and vectors creation done
 Load increment 1
 1 Newton-LS step
Caught amgx exception: Cannot allocate pinned memory
 at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0x412
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xc24
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple()+0x8b
 ./nonlinearelas : petsc::LinearSolver::solve(petsc::Matrix&, petsc::Vector&, petsc::Vector&, double)+0x25d
 ./nonlinearelas : NewtonLineSearch::solve()+0x222
 ./nonlinearelas : UpdatedLagrangian::solve()+0x3c4
 ./nonlinearelas : NonlinearElasticity::run()+0x6b
 ./nonlinearelas : main()+0x479
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./nonlinearelas() [0x45336f]

Here's the SLURM output:

AMGX ERROR: file /prj/hearttwins/jonatas.costa/nodal_cardiax/src/linalg/petsc_linear_solver.cpp line    371
AMGX ERROR: Insufficient memory.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

And those were the SLURM options utilized in this test:

#SBATCH --nodes=1             
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH -p nvidia_dev
#SBATCH --time=0:01:00
#SBATCH -J cardiax_test
#SBATCH --gres=gpu:0
#SBATCH --mem-bind=local

System info

Operating System: Linux Red Hat 7.9 CUDA Version: 12.3 GCC Version: 9.3 MPI Version: 3.4 AMGX Version: 2.5.0 GPU Model: NVIDIA Tesla K40t NVIDIA Driver Version: 470.82.01

Any guidance or suggestions on resolving this issue would be greatly appreciated. Thank you!

marsaev commented 4 months ago

@jooj211

GPU Model: NVIDIA Tesla K40t CUDA Version: 12.3 NVIDIA Driver Version: 470.82.01

CUDA v12.3 doesn't support Kepler architecture. CUDA v12.3 also requires more recent driver version: https://docs.nvidia.com/cuda/archive/12.3.0/cuda-toolkit-release-notes/index.html#id4 The latest CUDA that should work for Kepler GPUs is 11.4.

jooj211 commented 4 months ago

CUDA v12.3 doesn't support Kepler architecture. CUDA v12.3 also requires more recent driver version: https://docs.nvidia.com/cuda/archive/12.3.0/cuda-toolkit-release-notes/index.html#id4 The latest CUDA that should work for Kepler GPUs is 11.4.

Thank you for the heads up! I've now changed the version of CUDA to 11.4. After recompiling both AMGX and my program, however, the same problem persists:

Caught amgx exception: Cannot allocate pinned memory
 at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPoo>
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsi>
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(am>
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple(>
 ./nonlinearelas : petsc::LinearSolver::solve(petsc::Matrix&, petsc::Vector&, petsc::Vector&, d>
 ./nonlinearelas : NewtonLineSearch::solve()+0x222
 ./nonlinearelas : UpdatedLagrangian::solve()+0x3c4
 ./nonlinearelas : NonlinearElasticity::run()+0x6b
 ./nonlinearelas : main()+0x479
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./nonlinearelas() [0x45336f]

And in the slurm out file:


AMGX ERROR: file /prj/hearttwins/jonatas.costa/nodal_cardiax/src/linalg/petsc_linear_solver.cpp>
AMGX ERROR: Insufficient memory.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

``
marsaev commented 4 months ago

@jooj211

I wonder if it's somehow related to what another user reported in this issue: https://github.com/NVIDIA/AMGX/issues/313

First thing - does your environment support locked memory? You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34 Next, I would suggest trying out simple AMGX example with the same config that you use in your application.

jooj211 commented 4 months ago

@marsaev

First thing - does your environment support locked memory? You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34

Interestingly, the example ran without any issues, so it doesn't seem like a environment problem. When trying AMGX examples though, I ran into the same problem i had in my application:

AMGX version 2.5.0
Built on Jun 14 2024, 13:49:23
Compiled with CUDA Runtime 12.3, using CUDA driver 11.4
Warning: No mode specified, using dDDI by default.
Caught amgx exception: Cannot allocate pinned memory
 at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0x412
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xc24
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple()+0x8b
 ./examples/amgx_capi() [0x40172c]
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./examples/amgx_capi() [0x401de3]

Caught signal 11 - SIGSEGV (segmentation violation)
marsaev commented 4 months ago

Can you check last small thing - there is still this in the output:

Compiled with CUDA Runtime 12.3

can you check that CUDA 11.4 is actually used in the runtime? (you can try running ldd /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so)

Sorry for misleading output, since that message actually means what version is being used at the runtime ( https://github.com/NVIDIA/AMGX/blob/main/src/core.cu#L738-L751 )