pordyna commented 6 months ago

Problem with restarting

Hi! I was setting up warpx simulations on perlmutter and run into a somewhat weird problem. My simulation run without any problem, but when I restart them from a checkpoint they become extremely slow (it jumps from 200ms to 40s per time step!) I don't know what is happening here. Restarting again after few more (slow) steps doesn't change anything, it stays around 40s per step. From the verbose output / profiler it looks like the most of the time is spent in collisions. I tried switching on load balancing as well as disabling the sorting for deposition and switching to sorting into cells (probably better for collision anyway) but this didn't help.

The change in computation time always happens on the first restart, regardless of the time step so it can't be due to some rapid change in physics.

last step before checkpoint

STEP 5462 starts ...
--- INFO    : re-sorting particles
STEP 5462 ends. TIME = 1.025484411e-13 DT = 1.877488852e-17
Evolve time = 863.4138714 s; This step = 0.162532411 s; Avg. per step = 0.1580765052 s

first step after restart

STEP 5463 ends. TIME = 1.02567216e-13 DT = 1.877488852e-17
Evolve time = 30.37871414 s; This step = 30.37871414 s; Avg. per step = 30.37871414 s

here are my inputs and the full stdout and stderr input_output.zip

ax3l commented 6 months ago

Thank you for the detailed report!

This looks like the load imbalance after restart is really off, with collision physics suddenly dominating the time step. With the 37 steps after restart, it looks like you have not yet reached a load balance step.

Can you try to add a load balance directly after restart, e.g.,

algo.load_balance_intervals = 1:1:1,5464:5464,100

and see if that helps?

pordyna commented 6 months ago

Hi, thanks for the suggestion, I will try this out. But the same thing happened when I did not have load balancing enabled at all. (Unless it is enabled by default?) So my understanding would be that in that case the simulation would continue with the same initial distribution?

pordyna commented 6 months ago

unfortunately this did not change anything

pordyna commented 6 months ago

Soo it looks like after one of the restarts I got to write another checkpoint together with the diagnostics. And so here are some exemplary fields at step 5499. This doesn't look very good...

Figure 9(1)

pordyna commented 6 months ago

To be honest, it looks a bit like the subdomains are being swapped or misaligned during restart

ax3l commented 6 months ago

This looks very broken. Which exact version are you using of WarpX?

Thanks for the inputs files in the original issue, is there more we need to reproduce this restart bug?

pordyna commented 6 months ago

So it looks like I forgot to check out the latest release tag and was running from the development branch. To be specific from the following commit https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4. Could this be the problem? Here is the input file once again. warpex_used_inputs is missing quotes around analytical expressions and didn't work for resubmitting. inputs_from_picmi.txt And here is my environment: perlmutter_gpu_warpx.profile.txt And here is my dependencies install script install_gpu_dependencies.sh.txt

I suppose the setup could be somewhat simplified and still reproduce the bug.

ax3l commented 6 months ago

Did you checkpoint and restart with the exact same version of WarpX? Please try again if this occurs with the latest development version both writing and restarting?

pordyna commented 6 months ago

Yes it was all the same version. I was initially just testing automatic restart. Ok, I will recompile and check it .

pordyna commented 6 months ago

@ax3l This bug is still there when running on the newest development branch. To be exact on:

commit 9a017a67e5495263223da42db47657693b25bbd2 (HEAD -> development, origin/development, origin/HEAD)
Author: Eya D <81635404+EyaDammak@users.noreply.github.com>
Date:   Fri Feb 23 21:55:45 2024 -0800

before checkpoint: after checkpoint:

pordyna commented 6 months ago

Additionally, I have observed that some of my simulations crash just after writing checkpoint (but only sometimes) with the following error message:

amrex::Abort::6::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 598: an illegal memory access was encountered !!!
SIGABRT
See Backtrace.6 file for details
MPICH ERROR [Rank 6] [job id 22064439.0] [Fri Feb 23 05:39:47 2024] [nid002864] - Abort(6) (rank 6 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 6

(Those simulations run with the previously mentioned https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4)

the stdouts end like this:

STEP 487415 starts ...
STEP 487415 ends. TIME = 9.151162288e-12 DT = 1.877488852e-17
Evolve time = 59370.8985 s; This step = 0.115401544 s; Avg. per step = 0.1218076967 s

STEP 487416 starts ...
--- INFO    : re-sorting particles
--- INFO    : Writing openPMD file diags/particles00487416
--- INFO    : Writing checkpoint diags/checkpoint00487416
STEP 487416 ends. TIME = 9.151181063e-12 DT = 1.877488852e-17
Evolve time = 59376.77744 s; This step = 5.878937033 s; Avg. per step = 0.1218195083 s

STEP 487417 starts ...

It is always crashing in the next step. It is a bit confusing that it is not the same step (Do you write the checkpoint asynchronously to the execution ?).

I would say this suggests that the simulation is already writing corrupted checkpoints, probably accessing wrong part of the memory , and that sometimes results in an illegal memory access.

ax3l commented 6 months ago

I think this problem came in this month. X-ref #4735

Please use WarpX 24.02 or earlier for now until we fix it.

We plan to ship a fix in 24.03 with this PR: https://github.com/AMReX-Codes/amrex/pull/3783

pordyna commented 6 months ago

@ax3l Did you mean 24.01, or really 23.01?

ax3l commented 6 months ago

I actually meant 24.02 :D

Can you confirm that version was still ok?

pordyna commented 6 months ago

Haha no problem, I will check it today and let you know.

pordyna commented 6 months ago

@ax3l 24.02 works fine, thanks for solving the issue

ax3l commented 6 months ago

Thanks for confirming!

@pordyna the WarpX 24.03 release as of #4759 should also fix this. Thanks for reporting this before the release! :pray:

Please let us know if 24.03 shows any issues for you and we reopen this.

ECP-WarpX / WarpX

200 x slower after restart #4701

Problem with restarting

last step before checkpoint

first step after restart