Large Number of Grids Causing Insufficient GPU Memory

Lucas-Lucas1 commented 6 days ago

When performing 3D simulations, I want to divide a large number of grids, 2048*64*2048, but I encounter the error :

amrex::Abort::1::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
amrex::Abort::0::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
amrex::Abort::3::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
amrex::Abort::2::Out of gpu memory. Free: 2293760 Asked: 8388608 !!!
SIGABRT
See Backtrace.0 file for details
See Backtrace.1 file for details
See Backtrace.2 file for details
See Backtrace.3 file for details
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[gpu005.cluster.cn:51980] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[gpu005.cluster.cn:51980] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Backtrace.0.txt

In this case, what can I do to resolve the issue? As a new WarpX user, there are many details in the official documentation that I’m still learning.

Additionally, I want to apply a time-varying external electromagnetic field in a specific region. I’ve reviewed case 5046, but I noticed that using if(..., ...) statements there caused issues. Does the latest version of WarpX support setting this up with if statements, or is there a better method now?

ax3l commented 2 days ago

Hi @Lucas-Lucas1,

Thanks for reaching out. Did you already read https://warpx.readthedocs.io/en/latest/usage/workflows/domain_decomposition.html ?

To guide you a bit more, can you post the inputs and submission scripts you are using?

roelof-groenewald commented 2 days ago

Hi @Lucas-Lucas1. It would also be helpful to know how many GPUs (and what kind) are you trying to run this simulation on and how many particles do you have in total? Note that WarpX permanently keeps the particle quantities on GPU memory since moving them between the GPU and CPU is time consuming. For this reason you have to have enough total GPU memory to fit all the particles in your simulation. In my experience a 40Gb A100 GPU can hold about 200 million particles, so if I want to run a large simulation with, say, 800 million particles I need to use at least 4 A100 GPUs.

Lucas-Lucas1 commented 2 days ago

Thanks for your responses. In fact, I haven't yet reached the part regarding Domain Decomposition, I will study it as soon as possible.

Below are my input script test.py and submission script sbatch.sh. test.py.txt sbatch.sh.txt

My cluster consists of 9 NVIDIA DGX-A100 high-performance computing servers. Each server is equipped with dual AMD ROME 7742 64C128T processors, 1TB DDR4 memory, 8 NVIDIA TESLA A100 40GB SMX4 acceleration cards.

ECP-WarpX / WarpX

Large Number of Grids Causing Insufficient GPU Memory #5366