WarpX hanging on Lassen

bzdjordje commented 1 year ago

There were frequent hangs with WarpX with the last version and with the recent release it seems to have become persistent. I checked with a computer scientist at Livermore and when analyzing an ongoing run with a stack trace analysis tool we have it was concluded that most tasks were getting tied up in mpi communications (PAMI?). Attached are example stat traces as well as the files for the run in question.

output.txt warpx_rz_stat_00.pdf warpx_rz_stat_01.pdf warpx_rz_stat_02.pdf warpx_rz_stat_03.pdf warpx_rz_stat_04.pdf warpx_rz_stat_05.pdf warpx_rz_stat_check.pdf WarpXe.5125616.txt WarpXo.5125616.txt

dpgrote commented 1 year ago

I have access to Lassen and can take a look. What modules are you using when you build WarpX?

joshualudwig8 commented 1 year ago

I have also been experiencing frequent hangs with WarpX with the latest version on Lassen. I use the instructions from here to compile it. The hangs I experience are always happening during outputs, but appear to be random otherwise. (I ran the same simulation twice and both times it hung on the same output, I reran the simulation for a third time with warpx.random_seed = random and it ran fine.)

ax3l commented 1 year ago

Oh this looks like they updated maybe their OpenMPI and it has new bugs (or we use a new functionality triggering a dormant MPI bug). The IBM patched OpenMPI on Lassen is very buggy, unfortunately. Check out all these work-arounds...: https://warpx.readthedocs.io/en/latest/install/hpc/lassen.html#v100-gpus-16gb

@bzdjordje did they indicate if there was an OpenMPI change (or libfabric update)?

Do I see correctly that they hang in MPI_Allgather now in your errors traces? Try adding export OMPI_MCA_coll_ibm_skip_allgather=true in the job script, along the existing very similar line

# ...
# Work-around for broken IBM "libcollectives" MPI_Allgatherv and MPI_Allgather
#   https://github.com/ECP-WarpX/WarpX/pull/2874
export OMPI_MCA_coll_ibm_skip_allgatherv=true
export OMPI_MCA_coll_ibm_skip_allgather=true  # add this

I suspect MPI_Allgather and MPI_Allgatherv are both broken thanks to IBM patches... And a change in AMReX that now triggers another bug in IBMs libcollective library.

(MPI_Allgather is used in amrex::DistributionMapping::LeastUsedCPUsit seems cc @WeiqunZhang @atmyers).

ax3l commented 1 year ago

@bzdjordje @joshualudwig8 does that help?

Please also try disabling the work-arounds altogether as a second test:

# disable work-arounds in your submission scripts
#export OMPI_MCA_coll_ibm_skip_allgatherv=true
#export OMPI_MCA_coll_ibm_skip_allgather=true

ax3l commented 1 year ago

@atmyers I am checking the IBM docs: https://www.ibm.com/docs/en/SSZTET_EOS/eos/guide_101.pdf

Not sure there is a OMPI_MCA_coll_ibm_skip_allgather (w/o v). But there is clearly a libcollective call to AllGather... Let's see if that switches back to the OpenMPI default...

If this does not help, we might need to set another MCA parameter to disable the PAMI interfaces altogether if they are too buggy on Lassen.

ax3l commented 1 year ago

We got another possible option we can add from support:

On examining some of the STAT traces I see the involvement of Shmem. Perhaps you could try:

-M ‘-x PAMI_IBV_DISABLE_SHMEM=1 -mca coll_ibm_skip_allgatherv 1’

that would go into jsrun.

In parallel, since the system is being upgraded to RHEL-8 right now: I will write a new set of modules for the new OS. It comes with a new MPI version, so maybe this is fixed in those newer versions :)

bzdjordje commented 1 year ago

Hi @ax3l, Sorry for the delay, I submitted a job that was stuck in queue all of yesterday with and without export OMPI_MCA_coll_ibm_skip_allgather=true and they both hanged after 7 steps like before. I will set up another batch of test simulations with the additional flags you've listed here. Also, thank you for submitting an LC ticket, I am following that now.

Cheers, Blagoje

atmyers commented 1 year ago

I tried to reproduce this issue on Summit. I used the attached inputs deck and job submission script, and otherwise followed the instructions here. Rather than a hang, I actually get a segfault during the same step. Re-running with assertions enabled, I get:

63::Assertion `amrex::numParticlesOutOfRange(pti, range) == 0' failed, file "/ccs/home/atmyers2/WarpX/Source/Particles/WarpXParticleContainer.cpp", line 358, Msg: "Particles shape does not fit within til\
e (CPU) or guard cells (GPU) used for current deposition

This suggests a particle has moved farther than expected during a time step. One reason this could happen is if the fields have gone haywire due to an instability, causing unphysical particle velocities. Is there perhaps something weird going on with the field values prior to this? Probably this writing out-of-bounds is responsible for the hang on Lassen as well.

script.sh.txt inputs.txt

ax3l commented 1 year ago

Thank you @atmyers and @bzdjordje !

Andrew, can you rerun with signaling NaNs? That way we would see if there are some unit fields being read.

ax3l commented 1 year ago

@bzdjordje uses PSATD + RZ. I think this is the same issue as #4176, multi-MPI rank RZ+PSATD builds up invalid fields. Comms issue?

ax3l commented 1 year ago

@joshualudwig8 what is your geometry and number of MPI ranks, regarding your comment above? https://github.com/ECP-WarpX/WarpX/issues/4185#issuecomment-1681195331

ax3l commented 1 year ago

Running @bzdjordje 's input file:

**** WARNINGS ******************************************************************
* GLOBAL warning list  after  [ FIRST STEP ]
*
* --> [!  ] [PML] [raised 4 times]
*     Using PSATD together with PML may lead to instabilities if the plasma
*     touches the PML region. It is recommended to leave enough empty space
*     between the plasma boundary and the PML region.
*     @ Raised by: ALL
*
********************************************************************************

Does the plasma touch the boundary?

Changing the boundary to

boundary.field_hi = none damped

still shows the issue (also does not fix #4176)

joshualudwig8 commented 1 year ago

I was doing LWFA in the boosted frame with ionization and the ckc solver (128 grids on 128 devices). My issue could be different from the one Blagoje is having, Blagoje and I had previously discussed that we were both getting hangs so I figured I would post here too. I was initially thinking this could be related to the particle output that Reva is working on (but I am not sure). Some input parameters are below:

amr.n_cell = 512 16384 amr.max_grid_size_x = 256 amr.max_grid_size_y = 256 amr.max_grid_size_z = 256 amr.blocking_factor_x = 256 amr.blocking_factor_y = 256 amr.blocking_factor_z = 256 geometry.coord_sys = 0 geometry.dim = 2 boundary.field_lo = pml pml boundary.field_hi = pml pml warpx.verbose = 1 algo.current_deposition = esirkepov algo.charge_deposition = standard algo.field_gathering = energy-conserving algo.particle_pusher = vay algo.maxwell_solver = ckc warpx.cfl = 0.999 algo.particle_shape = 3 warpx.do_moving_window = 1 warpx.moving_window_dir = z warpx.moving_window_v = 1 warpx.gamma_boost = 20. warpx.boost_direction = z warpx.use_filter = 1 particles.use_fdtd_nci_corr = 1 diagnostics.diags_names = diag1 diag1.buffer_size = 128 diag1.file_min_digits = 8 diag1.diag_type = BackTransformed diag1.do_back_transformed_fields = 1 diag1.num_snapshots_lab = 401 diag1.dz_snapshots_lab = 8e-3 diag1.format = openpmd diag1.openpmd_backend = h5 diag1.fields_to_plot = Ex Ey Ez rho

ax3l commented 1 year ago

Thanks, Josh! Ok, we could check that with the new RHEL-8 software for you. Btw, do you see warnings in your inputs? We do not use geometry.coord_sys = 0 anymore :)

@bzdjordje for your problem, the issue goes away for me if I comment out or set to zero:

warpx.do_single_precision_comms = 0

Does this help in your case, too?

joshualudwig8 commented 1 year ago

I don't see any warnings for geometry.coord_sys = 0, but I do see that geometry.dim = 2 is unused.

ax3l commented 1 year ago

@joshualudwig8 I think it should read geometry.dims = 2 with an s as in the compile option: https://warpx.readthedocs.io/en/latest/usage/parameters.html#setting-up-the-field-mesh

bzdjordje commented 1 year ago

@ax3l I commented out #warpx.do_single_precision_comms = 1 but I still got a hanged simulation. A minor difference was that as opposed to hanging repeatedly at STEP 7 it now has hanged at STEP 29. I just did a repeat and got another hang also at STEP 29. Also, some what strangely, I requested 3 hrs of walltime, but the simulation terminated after ~30 minutes. Below are the WarpX and output logs, should be the the same as what I sent you earlier just with the line commented out.

warpx_hang.zip

RemiLehe commented 1 year ago

Thanks for the info.

I am able to reproduce the SegFault mentioned by @atmyers, but on CPU using only one MPI rank and one OpenMP thread on my local machine.

RemiLehe commented 1 year ago

@bzdjordje Could you remove these lines from your script:

warpx.grid_type = hybrid
warpx.do_current_centering = 1

(These options are not supported in RZ ; we should add an error message in WarpX.)

In addition, could you also replace this line:

warpx.filter_npass_each_dir = 0 1

with

warpx.filter_npass_each_dir = 1 1

(This is better adapted for PSATD simulations, which do support filtering along the r direction.)

RemiLehe commented 1 year ago

When making the above-mentioned changes, the code now crashes at STEP 257, in BTD code, with the following error:

===== TinyProfilers ======
main()
REG::WarpX::Evolve()
WarpX::Evolve()
WarpX::Evolve::step
Diagnostics::FilterComputePackFlush()
FlushFormatOpenPMD::WriteToFile()
WarpXOpenPMDPlot::WriteOpenPMDParticles()
ParticleContainer::copyParticles
ParticleContainer::addParticles

ax3l commented 1 year ago

X-ref:

4212 will now guard from accidentally using hybrid gridding in RZ
4216 should solve issues we see when back-transformed diagnostics are used

Both patches will be need.

RemiLehe commented 1 year ago

OK, after more investigation by @ax3l @WeiqunZhang (thanks for your help!), it seems that the bug I saw 3 days ago might be specific to Summit (where I was running this test).

@bzdjordje Would you be able to retry your simulation, but with the above-mentioned changes to your input script, i.e.:

Removing

warpx.grid_type = hybrid
warpx.do_current_centering = 1

Using

warpx.filter_npass_each_dir = 1 1

instead of

warpx.filter_npass_each_dir = 0 1

ax3l commented 1 year ago

The additional Summit bug, compiler-bug triggered segfault with NVCC 11.3 during copyParticles in preparation of BTD diagostics, will be worked around via https://github.com/AMReX-Codes/amrex/pull/3510

bzdjordje commented 1 year ago

@RemiLehe @ax3l It seems that with the changes to the filter, hybrid grid, and current centering that the simulation is able to run now, thank you!

ax3l commented 1 year ago

That is awesome, thanks a lot for testing and confirming!

Please do not hesitate to open issues if you see any other issues in the future.

With the bugs from last week fixed, I will close this and focus on the RHEL8 update and Python support on Lassen next :)

ECP-WarpX / WarpX

WarpX hanging on Lassen #4185

4212 will now guard from accidentally using hybrid gridding in RZ

4216 should solve issues we see when back-transformed diagnostics are used