Closed pordyna closed 6 months ago
Thank you for the detailed report!
This looks like the load imbalance after restart is really off, with collision physics suddenly dominating the time step. With the 37 steps after restart, it looks like you have not yet reached a load balance step.
Can you try to add a load balance directly after restart, e.g.,
algo.load_balance_intervals = 1:1:1,5464:5464,100
and see if that helps?
Hi, thanks for the suggestion, I will try this out. But the same thing happened when I did not have load balancing enabled at all. (Unless it is enabled by default?) So my understanding would be that in that case the simulation would continue with the same initial distribution?
unfortunately this did not change anything
Soo it looks like after one of the restarts I got to write another checkpoint together with the diagnostics. And so here are some exemplary fields at step 5499. This doesn't look very good...
To be honest, it looks a bit like the subdomains are being swapped or misaligned during restart
This looks very broken. Which exact version are you using of WarpX?
Thanks for the inputs files in the original issue, is there more we need to reproduce this restart bug?
So it looks like I forgot to check out the latest release tag and was running from the development branch. To be specific from the following commit https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4. Could this be the problem?
Here is the input file once again. warpex_used_inputs
is missing quotes around analytical expressions and didn't work for resubmitting.
inputs_from_picmi.txt
And here is my environment:
perlmutter_gpu_warpx.profile.txt
And here is my dependencies install script
install_gpu_dependencies.sh.txt
I suppose the setup could be somewhat simplified and still reproduce the bug.
Did you checkpoint and restart with the exact same version of WarpX?
Please try again if this occurs with the latest development
version both writing and restarting?
Yes it was all the same version. I was initially just testing automatic restart. Ok, I will recompile and check it .
@ax3l This bug is still there when running on the newest development branch. To be exact on:
commit 9a017a67e5495263223da42db47657693b25bbd2 (HEAD -> development, origin/development, origin/HEAD)
Author: Eya D <81635404+EyaDammak@users.noreply.github.com>
Date: Fri Feb 23 21:55:45 2024 -0800
before checkpoint: after checkpoint:
Additionally, I have observed that some of my simulations crash just after writing checkpoint (but only sometimes) with the following error message:
amrex::Abort::6::CUDA error 700 in file /global/homes/p/pordyna/src/warpx/build_pm_gpu/_deps/fetchedamrex-src/Src/Base/AMReX_GpuDevice.cpp line 598: an illegal memory access was encountered !!!
SIGABRT
See Backtrace.6 file for details
MPICH ERROR [Rank 6] [job id 22064439.0] [Fri Feb 23 05:39:47 2024] [nid002864] - Abort(6) (rank 6 in comm 496): application called MPI_Abort(comm=0x84000001, 6) - process 6
(Those simulations run with the previously mentioned https://github.com/ECP-WarpX/WarpX/tree/a9d8126b500e1c7197eb0ed1e52fd50bb09cbdf4)
the stdouts end like this:
STEP 487415 starts ...
STEP 487415 ends. TIME = 9.151162288e-12 DT = 1.877488852e-17
Evolve time = 59370.8985 s; This step = 0.115401544 s; Avg. per step = 0.1218076967 s
STEP 487416 starts ...
--- INFO : re-sorting particles
--- INFO : Writing openPMD file diags/particles00487416
--- INFO : Writing checkpoint diags/checkpoint00487416
STEP 487416 ends. TIME = 9.151181063e-12 DT = 1.877488852e-17
Evolve time = 59376.77744 s; This step = 5.878937033 s; Avg. per step = 0.1218195083 s
STEP 487417 starts ...
It is always crashing in the next step. It is a bit confusing that it is not the same step (Do you write the checkpoint asynchronously to the execution ?).
I would say this suggests that the simulation is already writing corrupted checkpoints, probably accessing wrong part of the memory , and that sometimes results in an illegal memory access.
I think this problem came in this month. X-ref #4735
Please use WarpX 24.02
or earlier for now until we fix it.
We plan to ship a fix in 24.03
with this PR: https://github.com/AMReX-Codes/amrex/pull/3783
@ax3l Did you mean 24.01, or really 23.01?
I actually meant 24.02
:D
Can you confirm that version was still ok?
Haha no problem, I will check it today and let you know.
@ax3l 24.02
works fine, thanks for solving the issue
Thanks for confirming!
@pordyna the WarpX 24.03
release as of #4759 should also fix this. Thanks for reporting this before the release! :pray:
Please let us know if 24.03
shows any issues for you and we reopen this.
Problem with restarting
Hi! I was setting up warpx simulations on perlmutter and run into a somewhat weird problem. My simulation run without any problem, but when I restart them from a checkpoint they become extremely slow (it jumps from 200ms to 40s per time step!) I don't know what is happening here. Restarting again after few more (slow) steps doesn't change anything, it stays around 40s per step. From the verbose output / profiler it looks like the most of the time is spent in collisions. I tried switching on load balancing as well as disabling the sorting for deposition and switching to sorting into cells (probably better for collision anyway) but this didn't help.
The change in computation time always happens on the first restart, regardless of the time step so it can't be due to some rapid change in physics.
last step before checkpoint
first step after restart
here are my inputs and the full stdout and stderr input_output.zip