Open n01r opened 2 years ago
Thanks for the report!
The problem here is that we did not fail in a new
but got caught by the system (sigkill) on Cori, which we cannot handle at that level.
Maybe @kngott or @WeiqunZhang have more thoughts on this; I cannot see an obvious way to handle this more user-friendly right away.
I encountered a certain case on Cori KNL nodes where a WarpX setup that likely ran out of memory reported a
bus error
instead.The binary was compiled as
warpx.2d.MPI.OMP.DP.PDP.OPMD.PSATD.QED
Input Deck input_sets.zip There are 3 setups running the same 2D box size on one, two or four KNL nodes.
The submit file
debug_2nodes.sbatch
fails with the following errors on 2 KNL nodes.The only output in
output.txt
readsThe
WarpX.e<JobID>
file shows the bus error:amr.max_grid_size_x = 5760
amr.max_grid_size_y = 5760
amr.max_grid_size_x = 2880
amr.max_grid_size_y = 5760
amr.max_grid_size_x = 1440
amr.max_grid_size_y = 2880
Only the setup asking for a single node with 8 MPI ranks fails with the (expected) OOM error:
WarpX.e<JobID>
file: