ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
306 stars 195 forks source link

Cori: Bus Error instead of OOM #3250

Open n01r opened 2 years ago

n01r commented 2 years ago

I encountered a certain case on Cori KNL nodes where a WarpX setup that likely ran out of memory reported a bus error instead.

The binary was compiled as warpx.2d.MPI.OMP.DP.PDP.OPMD.PSATD.QED

Input Deck input_sets.zip There are 3 setups running the same 2D box size on one, two or four KNL nodes.

The submit file debug_2nodes.sbatch fails with the following errors on 2 KNL nodes.

The only output in output.txt reads

MPI initialized with 16 MPI processes
MPI initialized with thread support level 3
OMP initialized with 8 OMP threads
AMReX (22.06-39-g2d931f63cb4d) initialized
WarpX (22.06-22-g6be401a3c732)
PICSAR (2becfe066559)
Level 0: dt = 4.338939207e-18 ; dx = 1.302083333e-09 ; dz = 1.302083333e-09

The WarpX.e<JobID> file shows the bus error:

srun: error: nid09797: task 7: Bus error
srun: launch/slurm: _step_signal: Terminating StepId=61191894.0
slurmstepd: error: *** STEP 61191894.0 ON nid09797 CANCELLED AT 2022-07-19T15:34:18 ***
srun: error: nid09797: tasks 3-4: Terminated
srun: error: nid09797: tasks 0,5-6: Terminated
srun: error: nid09797: task 1: Terminated
srun: error: nid09798: tasks 8-9,11,13,15: Terminated
srun: error: nid09797: task 2: Terminated
srun: error: nid09798: tasks 10,12,14: Terminated
srun: Force Terminated StepId=61191894.0
Fails w/ OOM Error Fails w/ Bus Error Runs
1 Node (8 MPI ranks) 2 Nodes (16 MPI ranks) 4 Nodes (32 MPI ranks)
amr.blocking_factor = 128
amr.max_grid_size_x = 5760
amr.max_grid_size_y = 5760
amr.blocking_factor = 64
amr.max_grid_size_x = 2880
amr.max_grid_size_y = 5760
amr.blocking_factor = 32
amr.max_grid_size_x = 1440
amr.max_grid_size_y = 2880

Only the setup asking for a single node with 8 MPI ranks fails with the (expected) OOM error:

WarpX.e<JobID> file:

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=61191824.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: nid02519: task 5: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=61191824.0
slurmstepd: error: *** STEP 61191824.0 ON nid02519 CANCELLED AT 2022-07-19T15:24:02 ***
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=61191824.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
ax3l commented 2 years ago

Thanks for the report!

The problem here is that we did not fail in a new but got caught by the system (sigkill) on Cori, which we cannot handle at that level.

Maybe @kngott or @WeiqunZhang have more thoughts on this; I cannot see an obvious way to handle this more user-friendly right away.