Open jooj211 opened 4 months ago
@jooj211
GPU Model: NVIDIA Tesla K40t CUDA Version: 12.3 NVIDIA Driver Version: 470.82.01
CUDA v12.3 doesn't support Kepler architecture. CUDA v12.3 also requires more recent driver version: https://docs.nvidia.com/cuda/archive/12.3.0/cuda-toolkit-release-notes/index.html#id4 The latest CUDA that should work for Kepler GPUs is 11.4.
CUDA v12.3 doesn't support Kepler architecture. CUDA v12.3 also requires more recent driver version: https://docs.nvidia.com/cuda/archive/12.3.0/cuda-toolkit-release-notes/index.html#id4 The latest CUDA that should work for Kepler GPUs is 11.4.
Thank you for the heads up! I've now changed the version of CUDA to 11.4. After recompiling both AMGX and my program, however, the same problem persists:
Caught amgx exception: Cannot allocate pinned memory
at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
/scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPoo>
/scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsi>
/scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(am>
/scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple(>
./nonlinearelas : petsc::LinearSolver::solve(petsc::Matrix&, petsc::Vector&, petsc::Vector&, d>
./nonlinearelas : NewtonLineSearch::solve()+0x222
./nonlinearelas : UpdatedLagrangian::solve()+0x3c4
./nonlinearelas : NonlinearElasticity::run()+0x6b
./nonlinearelas : main()+0x479
/lib64/libc.so.6 : __libc_start_main()+0xf5
./nonlinearelas() [0x45336f]
And in the slurm out file:
AMGX ERROR: file /prj/hearttwins/jonatas.costa/nodal_cardiax/src/linalg/petsc_linear_solver.cpp>
AMGX ERROR: Insufficient memory.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor
``
@jooj211
I wonder if it's somehow related to what another user reported in this issue: https://github.com/NVIDIA/AMGX/issues/313
First thing - does your environment support locked memory? You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34 Next, I would suggest trying out simple AMGX example with the same config that you use in your application.
@marsaev
First thing - does your environment support locked memory? You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34
Interestingly, the example ran without any issues, so it doesn't seem like a environment problem. When trying AMGX examples though, I ran into the same problem i had in my application:
AMGX version 2.5.0
Built on Jun 14 2024, 13:49:23
Compiled with CUDA Runtime 12.3, using CUDA driver 11.4
Warning: No mode specified, using dDDI by default.
Caught amgx exception: Cannot allocate pinned memory
at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
/prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0x412
/prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b
/prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xc24
/prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple()+0x8b
./examples/amgx_capi() [0x40172c]
/lib64/libc.so.6 : __libc_start_main()+0xf5
./examples/amgx_capi() [0x401de3]
Caught signal 11 - SIGSEGV (segmentation violation)
Can you check last small thing - there is still this in the output:
Compiled with CUDA Runtime 12.3
can you check that CUDA 11.4 is actually used in the runtime?
(you can try running ldd /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so
)
Sorry for misleading output, since that message actually means what version is being used at the runtime ( https://github.com/NVIDIA/AMGX/blob/main/src/core.cu#L738-L751 )
I am encountering a "Cannot allocate pinned memory" error while running a program that uses AMGX solvers on a supercomputer that uses the SLURM Workload Manager. The program fails to allocate the necessary pinned memory for efficient GPU memory transfers.
Here's the full output file:
Here's the SLURM output:
And those were the SLURM options utilized in this test:
System info
Operating System: Linux Red Hat 7.9 CUDA Version: 12.3 GCC Version: 9.3 MPI Version: 3.4 AMGX Version: 2.5.0 GPU Model: NVIDIA Tesla K40t NVIDIA Driver Version: 470.82.01
Any guidance or suggestions on resolving this issue would be greatly appreciated. Thank you!