ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
299 stars 191 forks source link

SEGFAULT in amrex::BoxArray::coarsenable from Average::CoarsenAndInterpolate #940

Closed MaxThevenet closed 4 years ago

MaxThevenet commented 4 years ago

Automated test reduced_diags_loadbalancecosts_timers on PR #933 fails due to a segfault in


at the coarsenable call in the ASSERT line below

    BoxArray ba_tmp = amrex::convert( mf_src.boxArray(), mf_dst.ixType().toIntVect() );
    AMREX_ALWAYS_ASSERT_WITH_MESSAGE( ba_tmp.coarsenable( crse_ratio ),
        "source MultiFab converted to staggering of destination MultiFab is not coarsenable" );

When adding


right before the ASSERT line, the code returns

(BoxArray maxbox(16)
       ((0,0,0) (31,31,31) (0,0,0)) ((96,0,0) (127,31,31) (0,0,0)) ((64,0,0) (95,31,31) (0,0,0)) ((32,0,0) (63,31,31) (0,0,0)) ((0,0,96) (31,31,127) (0,0,0)) ((0,0,64) (31,31,95) (0,0,0)) ((0,0,32) (31,31,63) (0,0,0)) ((96,0,96) (127,31,127) (0,0,0)) ((96,0,64) (127,31,95) (0,0,0)) ((96,0,32) (127,31,63) (0,0,0)) ((64,0,96) (95,31,127) (0,0,0)) ((64,0,64) (95,31,95) (0,0,0)) ((64,0,32) (95,31,63) (0,0,0)) ((32,0,96) (63,31,127) (0,0,0)) ((32,0,64) (63,31,95) (0,0,0)) ((32,0,32) (63,31,63) (0,0,0)) )


I do not see what is wrong in this BoxArray or in the coarsening ratio, so I don't get why coarsenable fails so far.

To reproduce the issue, I executed the code with

mpirun -np 2 ~/warpx/Bin/main3d.gnu.DEBUG.TPROF.MPI.OMP.ex inputs_loadbalancecosts warpx.do_dynamic_scheduling=0 warpx.serialize_ics=1 algo.load_balance_costs_update=Timers diag1.file_prefix=reduced_diags_loadbalancecosts_timers_plt

and the error is random, so far either a segfault or (maybe more helpful)

STEP 3 starts ...
STEP 3 ends. TIME = 2.177455318e-10 DT = 7.258184393e-11
Walltime = 12.398333 s; This step = 4.063404 s; Avg. per step = 4.132777667 s
1::Assertion `p.allGT(IntVect::TheZeroVector())' failed, file "/Users/mthevenet/amrex//Src/Base/AMReX_IntVect.H", line 590 !!!

Here's the Backtrace generated.

EZoni commented 4 years ago

I observed that the segfault happens only when running with more than 1 MPI rank. Do you confirm that?

MaxThevenet commented 4 years ago

I agree, good point! Also, it doesn't seem to occur in 2d, but I'm not sure about it.

EZoni commented 4 years ago

It seems to me that the segfault is not caused by the call to the function coarsenable, but rather by the initialization

Array4<Real const> const& arr_src = mf_src.const_array( mfi );

right before the ParallelFor inside Average::CoarsenAndInterpolateLoop. In other words, commenting out the initialization above and the subsequent ParallelFor (as it depends on the initialization of arr_src), the code seems to run without segfault.

EZoni commented 4 years ago

@RevathiJambunathan Did you mention that by changing the input parameter warpx.load_balance_int in the input file Examples/Tests/reduced_diags/inputs_loadbalancecosts from warpx.load_balance_int=2 to warpx.load_balance_int=1 you get segfault after 1 iteration instead of 3? I don't observe the same behavior at the moment: if warpx.load_balance_int=1, I still get segfault after the third step. Did I maybe misunderstand what you are observing?

MaxThevenet commented 4 years ago

@EZoni the segfault probably occurs at the first dump iteration after a LB iteration. I think @RevathiJambunathan and @WeiqunZhang identified the bug, with @mrowan137's help @RevathiJambunathan is currently working on a fix.

RevathiJambunathan commented 4 years ago

@EZoni sorry -- I was not fully clear when I talked about it this morning. I also changed diag.period = 1 and played with load_balance_int=1,2,3,4 to confirm if that was the cause for error.

EZoni commented 4 years ago

This issue was fixed in #943. The test reduced_diags_loadbalancecosts_timers, that was crashing in #933 before the fix in #943 was merged, runs successfully now.