error when using openPMD + BTD?

MaxThevenet commented 3 years ago

I encountered an error with openPMD BTD with this 2D input file when executing e.g.

# compilation
cmake .. -DWarpX_DIMS=2 -DWarpX_OPENPMD=ON -DWarpX_QED=OFF -DWarpX_COMPUTE=NOACC
# Same result without -DWarpX_QED=OFF
# execution
mpirun -np 4 ~/warpx/build/bin/warpx inputs > output.txt

The simulation runs until the end and crashes at the finalize step with error message

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: [Series] Detected illegal access to iteration that has been closed previously.
SIGABRT
See Backtrace.2.0 file for details
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 DUP FROM 0
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Backtrace:

===== TinyProfilers ======
main()
WarpX::Evolve()
Diagnostics::FilterComputePackFlush()
FlushFormatOpenPMD::WriteToFile()
WarpXOpenPMDPlot::WriteOpenPMDFields()

I also ran in Debug mode. This reproducer then takes 30 min (instead of 1 min in production mode), but it does not provide any additional information. Sometimes, when changing the number of BTD snapshots or resolution, the problem disappears. However, in this input file, the simulation runs long enough for each snapshot fo be full (they all have the same size, and increasing the number of time steps from the automatically-computed 3391 to 4000 doesn't remove the issue).

The output of the CMake command is here. This could also be a problem with my configuration. I de-activated openMP, just in case this could cause issues, but I have the same problem with openMP. It would already be very useful if someone tried this reproducer.

ax3l commented 3 years ago

updated inputs file for WarpX version 21.07-66-gbf7150fa8: inputs.txt Can reproduce locally on Ubuntu as well.

I wonder if Detected illegal access to iteration that has been closed previously. truly is the only error? We fixed to not throw this (final) message in case a previous error occurred recently: https://github.com/openPMD/openPMD-api/pull/1018 Quickly re-compiled with cmake -S . -B build -DWarpX_DIMS=2 -DWarpX_OPENPMD=ON -DWarpX_QED=OFF -DWarpX_COMPUTE=NOACC -DWarpX_openpmd_branch=dev using the nearly released 0.14.0 openPMD-api release - not related to that issue.

Will dig a bit more, sorry for the tremendous delay.

ax3l commented 3 years ago

The first two lab frame snapshots are not flushed until we do the final close. Somehow the lab frame outputs are: 2, 4, 5, 9, ... (FilterComputePackFlushLastTimestep:) 0, 1, 2 - this "jumping" of labframe snapshots also happens with the plotfile output

RevathiJambunathan commented 3 years ago

[Listing things to check based on offline conversation : Axel and Reva]

[x] do not force_flush if last BTD is already written out [not directly related to this issue. its need to prevent re-opening a closed file] -> #2148
[ ] In the 2D example, the first two snapshots (0 & 1) are not fully flushed (i,.e, not fully filled) while snapshot 2 onward everything is fully filled and flushed. Need to track why this is the case. Possibly, domain extent for first two snapshots are not fitting the reconstructed buffer
[ ] The i-buffers are not contiguously flushed 2, 4, 7, 9 12 14 -- possibly from rounding issues in slice reconstruction? this is the same issue as the previous point.
[ ] Optimization : once last BTD buffer for the ith snapshot (i_buffer) is flushed, we can clear the memory for the multifab releasing more memory.

ax3l commented 3 years ago

[ ] Found why some buffers / lab snapshots are not dumped:
For i_buffer=0, the m_buffer_counter[0] peaks out at 249 and is thus never dumped.
i_buffer=3 maxes out at 255 (size: 256)
...

I think that problem comes from BTDiagnostics::PrepareFieldDataForOutput().

Then "underful" buffers are dumped after the evolve loop via FilterComputePackFlushLastTimestep.

ax3l commented 3 years ago

[x] Step 2 is flushed again in FilterComputePackFlushLastTimestep because of the force_flush flag. It's size is m_buffer_counter[2]=0 at that point and the openPMD backend catches it as double-write attempt, thus the reported error message is thrown.

We could skip m_buffer_counter[i_buffer] of size 0 in the last timestep dump, which would:

skip already written lab snapshots
skip not-even-started lab snapshots (already done automatically with our other counters)

Fix for this aspect in #2148

ECP-WarpX / WarpX

error when using openPMD + BTD? #1915