ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
299 stars 191 forks source link

SEGFAULT in amrex::BoxArray::coarsenable from Average::CoarsenAndInterpolate #940

Closed MaxThevenet closed 4 years ago

MaxThevenet commented 4 years ago

Automated test reduced_diags_loadbalancecosts_timers on PR #933 fails due to a segfault in

Average::CoarsenAndInterpolate

at the coarsenable call in the ASSERT line below

    BoxArray ba_tmp = amrex::convert( mf_src.boxArray(), mf_dst.ixType().toIntVect() );
    AMREX_ALWAYS_ASSERT_WITH_MESSAGE( ba_tmp.coarsenable( crse_ratio ),
        "source MultiFab converted to staggering of destination MultiFab is not coarsenable" );

When adding

    Print()<<ba_tmp<<'\n';
    Print()<<crse_ratio<<'\n';

right before the ASSERT line, the code returns

(BoxArray maxbox(16)
       m_ref->m_hash_sig(0)
       ((0,0,0) (31,31,31) (0,0,0)) ((96,0,0) (127,31,31) (0,0,0)) ((64,0,0) (95,31,31) (0,0,0)) ((32,0,0) (63,31,31) (0,0,0)) ((0,0,96) (31,31,127) (0,0,0)) ((0,0,64) (31,31,95) (0,0,0)) ((0,0,32) (31,31,63) (0,0,0)) ((96,0,96) (127,31,127) (0,0,0)) ((96,0,64) (127,31,95) (0,0,0)) ((96,0,32) (127,31,63) (0,0,0)) ((64,0,96) (95,31,127) (0,0,0)) ((64,0,64) (95,31,95) (0,0,0)) ((64,0,32) (95,31,63) (0,0,0)) ((32,0,96) (63,31,127) (0,0,0)) ((32,0,64) (63,31,95) (0,0,0)) ((32,0,32) (63,31,63) (0,0,0)) )

(1,1,1)

I do not see what is wrong in this BoxArray or in the coarsening ratio, so I don't get why coarsenable fails so far.

To reproduce the issue, I executed the code with

mpirun -np 2 ~/warpx/Bin/main3d.gnu.DEBUG.TPROF.MPI.OMP.ex inputs_loadbalancecosts warpx.do_dynamic_scheduling=0 warpx.serialize_ics=1 algo.load_balance_costs_update=Timers diag1.file_prefix=reduced_diags_loadbalancecosts_timers_plt

and the error is random, so far either a segfault or (maybe more helpful)

STEP 3 starts ...
STEP 3 ends. TIME = 2.177455318e-10 DT = 7.258184393e-11
Walltime = 12.398333 s; This step = 4.063404 s; Avg. per step = 4.132777667 s
1::Assertion `p.allGT(IntVect::TheZeroVector())' failed, file "/Users/mthevenet/amrex//Src/Base/AMReX_IntVect.H", line 590 !!!
SIGABRT

Here's the Backtrace generated.

EZoni commented 4 years ago

I observed that the segfault happens only when running with more than 1 MPI rank. Do you confirm that?

MaxThevenet commented 4 years ago

I agree, good point! Also, it doesn't seem to occur in 2d, but I'm not sure about it.

EZoni commented 4 years ago

It seems to me that the segfault is not caused by the call to the function coarsenable, but rather by the initialization

Array4<Real const> const& arr_src = mf_src.const_array( mfi );

right before the ParallelFor inside Average::CoarsenAndInterpolateLoop. In other words, commenting out the initialization above and the subsequent ParallelFor (as it depends on the initialization of arr_src), the code seems to run without segfault.

EZoni commented 4 years ago

@RevathiJambunathan Did you mention that by changing the input parameter warpx.load_balance_int in the input file Examples/Tests/reduced_diags/inputs_loadbalancecosts from warpx.load_balance_int=2 to warpx.load_balance_int=1 you get segfault after 1 iteration instead of 3? I don't observe the same behavior at the moment: if warpx.load_balance_int=1, I still get segfault after the third step. Did I maybe misunderstand what you are observing?

MaxThevenet commented 4 years ago

@EZoni the segfault probably occurs at the first dump iteration after a LB iteration. I think @RevathiJambunathan and @WeiqunZhang identified the bug, with @mrowan137's help @RevathiJambunathan is currently working on a fix.

RevathiJambunathan commented 4 years ago

@EZoni sorry -- I was not fully clear when I talked about it this morning. I also changed diag.period = 1 and played with load_balance_int=1,2,3,4 to confirm if that was the cause for error.

EZoni commented 4 years ago

This issue was fixed in #943. The test reduced_diags_loadbalancecosts_timers, that was crashing in #933 before the fix in #943 was merged, runs successfully now.