ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
291 stars 184 forks source link

cartesian3d: BTD_ReducedSliceDiag Test Unstable #2382

Closed ax3l closed 2 years ago

ax3l commented 2 years ago

In CI, we see that the cartesian3d test BTD_ReducedSliceDiag sporatically crashes. This needs maybe a build in debug mode locally to find out where it crashes.

CC @RemiLehe @RevathiJambunathan

NeilZaim commented 2 years ago

I have tried in debug mode locally but couldn't see a crash. :'( I have also tried with valgrind but nothing popped up.

ax3l commented 2 years ago

Backtrace printing added in https://github.com/ECP-WarpX/regression_testing/pull/16, let's see if we catch it next time

EZoni commented 2 years ago

Here is another backtrace for this from a CI raw log in #2429: BackTrace.txt.

ax3l commented 2 years ago

Saw the same backtrace again in #2479: Backtrace.txt

RevathiJambunathan commented 2 years ago

Saw the same backtrace again in #2530 CI_Backtrace.txt

RevathiJambunathan commented 2 years ago

It does not crash locally for me

RevathiJambunathan commented 2 years ago

a potential fix #2543

RevathiJambunathan commented 2 years ago

Closed PR #2543 since BTD was not called at initialization anyway. Looking into Source/Diagnostics/BTD_Plotfile_Header_Impl.cpp:40

ax3l commented 2 years ago

Seen again in https://github.com/ECP-WarpX/WarpX/pull/2574#issuecomment-974506256 with the following backtrace:

 5: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911) [0x7fd579967911]
    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_data(char*) at /usr/include/c++/9/bits/basic_string.h:179
 (inlined by) void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag) at /usr/include/c++/9/bits/basic_string.tcc:219
 (inlined by) void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct_aux<char const*>(char const*, char const*, std::__false_type) at /usr/include/c++/9/bits/basic_string.h:247

 6: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c) [0x7fd57997338c]
    amrex::Box::coarsen(amrex::IntVect const&) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_Box.H:801
 (inlined by) PEC::ApplyPECtoBfield(std::array<amrex::MultiFab*, 3ul>, int, PatchType) at /tmp/ci-oFCLwlOx0A/warpx/./Source/BoundaryConditions/WarpX_PEC.cpp:131

 7: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7) [0x7fd5799733f7]
    amrex::coarsen(int, int) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_IntVect.H:28
 (inlined by) amrex::IntVect::coarsen(amrex::IntVect const&) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_IntVect.H:591
 (inlined by) amrex::Box::coarsen(amrex::IntVect const&) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_Box.H:804
 (inlined by) PEC::ApplyPECtoBfield(std::array<amrex::MultiFab*, 3ul>, int, PatchType) at /tmp/ci-oFCLwlOx0A/warpx/./Source/BoundaryConditions/WarpX_PEC.cpp:131

 8: /lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9) [0x7fd5799736a9]
    amrex::coarsen(int, int) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_IntVect.H:32
 (inlined by) amrex::IntVect::coarsen(amrex::IntVect const&) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_IntVect.H:591
 (inlined by) amrex::Box::coarsen(amrex::IntVect const&) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_Box.H:804
 (inlined by) PEC::ApplyPECtoBfield(std::array<amrex::MultiFab*, 3ul>, int, PatchType) at /tmp/ci-oFCLwlOx0A/warpx/./Source/BoundaryConditions/WarpX_PEC.cpp:131

 9: /lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt19__throw_ios_failurePKc+0x91) [0x7fd57996ac23]
    ?? ??:0

10: /lib/x86_64-linux-gnu/libstdc++.so.6(+0x114b72) [0x7fd5799ddb72]
    amrex::PODVector<int, std::allocator<int> >::GetNewCapacity(unsigned long) const at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_PODVector.H:504
 (inlined by) amrex::PODVector<int, std::allocator<int> >::resize(unsigned long) at /tmp/ci-oFCLwlOx0A/amrex//Src/Base/AMReX_PODVector.H:445
 (inlined by) amrex::ParticleContainer<0, 0, 4, 0, amrex::PinnedArenaAllocator>::SetParticleSize() at /tmp/ci-oFCLwlOx0A/amrex//Src/Particle/AMReX_ParticleContainerI.H:9

11: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0xfb4a8) [0x55f88d9534a8]
    std::basic_ios<char, std::char_traits<char> >::setstate(std::_Ios_Iostate) at /usr/include/c++/9/bits/basic_ios.h:158
 (inlined by) std::basic_ifstream<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) at /usr/include/c++/9/fstream:661
 (inlined by) std::basic_ifstream<char, std::char_traits<char> >::open(char const*, std::_Ios_Openmode) at /usr/include/c++/9/fstream:658
 (inlined by) BTDPlotfileHeaderImpl::ReadHeaderData() at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/BTD_Plotfile_Header_Impl.cpp:32

12: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0xef930) [0x55f88d947930]
    BTDPlotfileHeaderImpl::set_timestep(int) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/BTD_Plotfile_Header_Impl.H:83
 (inlined by) BTDiagnostics::InterleaveBufferAndSnapshotHeader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/BTDiagnostics.cpp:739

13: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0xf21b2) [0x55f88d94a1b2]
    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_is_local() const at /usr/include/c++/9/bits/basic_string.h:222
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() at /usr/include/c++/9/bits/basic_string.h:231
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() at /usr/include/c++/9/bits/basic_string.h:658
 (inlined by) BTDiagnostics::MergeBuffersForPlotfile(int) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/BTDiagnostics.cpp:712

14: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0xf2c26) [0x55f88d94ac26]
    BTDiagnostics::Flush(int) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/BTDiagnostics.cpp:649

15: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0xae888) [0x55f88d906888]
    Diagnostics::FilterComputePackFlush(int, bool) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/Diagnostics.cpp:341

16: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0xad201) [0x55f88d905201]
    MultiDiagnostics::FilterComputePackFlush(int, bool) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Diagnostics/MultiDiagnostics.cpp:74 (discriminator 2)

17: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0x437584) [0x55f88dc8f584]
    WarpX::Evolve(int) at /tmp/ci-oFCLwlOx0A/warpx/./Source/Evolve/WarpXEvolve.cpp:341

18: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0x5e854) [0x55f88d8b6854]
    std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider::_Alloc_hider(char*, std::allocator<char> const&) at /usr/include/c++/9/bits/basic_string.h:157
 (inlined by) std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&) at /usr/include/c++/9/bits/basic_string.h:526
 (inlined by) main at /tmp/ci-oFCLwlOx0A/warpx/./Source/main.cpp:69

19: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fd57952d0b3]

20: ./main3d.gnu.TEST.TPROF.MTMPI.OMP.QED.OPMD.PSATD.GPUCLOCK.ex(+0x6eb3e) [0x55f88d8c6b3e]
    ?? ??:0

===== TinyProfilers ======
main()
WarpX::Evolve()
WarpX::Evolve::step
Diagnostics::FilterComputePackFlush()

   WARNING: +++ End of backtrace: BTD_ReducedSliceDiag.Backtrace.0.0 +++

   BTD_ReducedSliceDiag CRASHED (backtraces produced)
ax3l commented 2 years ago

I think this is a race condition for BTDiagnostics::MergeBuffersForPlotfile @RevathiJambunathan @atmyers.

Last seen in: https://github.com/ECP-WarpX/WarpX/pull/2300

Stack:

Is it possible that the file does not exist yet? I think before we start merging via BTDiagnostics::MergeBuffersForPlotfile, we must make sure that:

Only the first two steps would be ok for now but are not 100% ideal, because FS-sync != MPI context sync.

ax3l commented 2 years ago

Fix for CI flakyness (race condition between writing MPI ranks and readers) in #2608.

2608 also contains a to-do to extend this fix for a similar issue that could potentially occur at scale with respect to out-of-sync parallel FS operations (future, when needed).