NVIDIA / AMGX

Distributed multigrid linear solver library on GPU
464 stars 136 forks source link

Caught amgx exception: Cannot allocate pinned memory #313

Open AnjaliSandip opened 3 weeks ago

AnjaliSandip commented 3 weeks ago

I am using amgx with PETSc (-pc_type amgx) to run multiphysics simulations. I am encountering this error even after having scaled down the problem size significantly.

Caught amgx exception: Cannot allocate pinned memory

I have attached the output and error log files for your reference. Thank you for any feedback you can provide.

outlog.docx errlog.docx

Environment information:

marsaev commented 3 weeks ago

@AnjaliSandip It seems error indicates that pinned memory pool cannot be allocated:

Caught amgx exception: Cannot allocate pinned memory
 at: /home/anjali.sandip/ISSM/ISSM/externalpackages/petsc/src/arch-linux-c-opt/externalpackages/git.amgx/src/global_thread_handle.cu:374

It's size is currently fixed to 100 MB: https://github.com/NVIDIA/AMGX/blob/v2.4.0/src/global_thread_handle.cu#L51 regardless of the input data ( and this allocation happens during resources creation at which point we don't know problem size)

Is your process allowed to allocate page-locked memory? (i.e. for docker containers you have to provide respective ulimit flag, i.e.: --ulimit memlock=-1)

AnjaliSandip commented 3 weeks ago

Thank you for your response. I am using PETSc with AMGX interface. PETSc has this option of setting the minimum data size for which pinned memory will be used for host (CPU) allocations.

include "petscvec.h"

VecSetPinnedMemoryMin (Vec v, size_t mbytes)

Is this what you are referring to?

On Fri, Jun 14, 2024 at 8:31 PM marsaev @.***> wrote:

@AnjaliSandip https://github.com/AnjaliSandip It seems error indicates that pinned memory pool cannot be allocated:

Caught amgx exception: Cannot allocate pinned memory at: /home/anjali.sandip/ISSM/ISSM/externalpackages/petsc/src/arch-linux-c-opt/externalpackages/git.amgx/src/global_thread_handle.cu:374

It's size is currently fixed to 100 MB: https://github.com/NVIDIA/AMGX/blob/v2.4.0/src/global_thread_handle.cu#L51 regardless of the input data ( and this allocation happens during resources creation at which point we don't know problem size)

Is your process allowed to allocate page-locked memory? (i.e. for docker containers you have to provide respective ulimit flag, i.e.: --ulimit memlock=-1)

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/AMGX/issues/313#issuecomment-2168979049, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOQK52GYU3ROKT6RC5HAU43ZHODOFAVCNFSM6AAAAABJKW5RNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRYHE3TSMBUHE . You are receiving this because you were mentioned.Message ID: @.***>

marsaev commented 6 days ago

@AnjaliSandip sorry for the delayed reply. I'm not familiar with PETSc internals, but unless PETSc environment somehow hooks cudaMallocHost, it's settings shouldn't affect AMGX, since AMGX using a call directly to CUDA Runtime: https://github.com/NVIDIA/AMGX/blob/v2.4.0/src/global_thread_handle.cu#L378

You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34

If there is no obvious/easy fix to page locked memory, I would suggest opening a ticket for PETSc (https://gitlab.com/petsc/petsc/-/issues), as they are more knowledgeable about PETSc details that might be important here. You can link this issue for the reference and i can follow up in the case there would be any further questions to AMGX.