wdmerger is non-deterministic when using CUDA

AMReX-Astro / Castro

Castro (Compressible Astrophysics): An adaptive mesh, astrophysical compressible (radiation-, magneto-) hydrodynamics simulation code for massively parallel CPU and GPU architectures.

http://amrex-astro.github.io/Castro

Other

293 stars 99 forks source link

wdmerger is non-deterministic when using CUDA #679

Closed maxpkatz closed 4 years ago

maxpkatz commented 4 years ago

Running wdmerger when compiled with CUDA on Summit shows non-determinism. If executed with

jsrun -n 1 -a 1 -c 1 -g 1 ./Castro3d.pgi.MPI.CUDA.ex inputs_collision amr.plot_files_output=1 amr.plot_int=1 castro.limit_fluxes_on_small_dens=0

several times in a row (where inputs_collision comes from #646), once in a while the results will be substantially different (e.g. O(0.001) in fcompare). This effect seems to disappear (or at least be much more rare) when gravity.max_multipole_order = 0.

maxpkatz commented 4 years ago

The effect is also not present (or at least mitigated) if castro.source_term_predictor=0.

maxpkatz commented 4 years ago

cuda-memcheck --tool racecheck reports a race condition in the AMReX reduction code:

========= ERROR: Race reported between Read access at 0x00000590 in amrex_fort_module_amrex_reduce_add_device_
=========     and Write access at 0x00000520 in amrex_fort_module_amrex_reduce_add_device_ [16384 hazards]

maxpkatz commented 4 years ago

The AMReX reduction race condition has been resolved, but this effect is still present. It definitely goes away if Poisson gravity is not used. If we're doing Poisson gravity, then it is present even if we're fully periodic, which suggests an MLMG issue.

WeiqunZhang commented 4 years ago

Where is inputs_collision?

maxpkatz commented 4 years ago

https://github.com/AMReX-Astro/Castro/files/3449821/inputs_collision.txt https://github.com/AMReX-Astro/Castro/files/3449822/probin_collision.txt

This can also be demonstrated with evrard_collapse, using inputs.test amr.max_level=0 max_step=100.

WeiqunZhang commented 4 years ago

I can reproduce what you saw with evrag_collapse on my desktop without MPI. Bit if I make the following change in Castro.

--- a/Source/gravity/Gravity.cpp
+++ b/Source/gravity/Gravity.cpp
@@ -1474,6 +1474,8 @@ Gravity::init_multipole_grav()
 void
 Gravity::fill_multipole_BCs(int crse_level, int fine_level, const Vector<MultiFab*>& Rhs, MultiFab& phi)
 {
+    amrex::Gpu::LaunchSafeGuard lsg(false);
+
     // Multipole BCs only make sense to construct if we are starting from the coarse level.

     BL_ASSERT(crse_level == 0);

then it's deterministic (at least after 100 steps). I think the problem is ReduceSum in that function.

maxpkatz commented 4 years ago

Currently the issue with the multipole BCs seems to be that there is substantial numerlcal sensitivity that can add up to very divergent outcomes when floating point roundoff error accumulates. For example, replacing

r**(-l-1)

with

r**(-1)

(when l == 0) in ca_put_multipole_phi results in O(1e-3) difference in the density after 100 steps of evrard_collapse, even for pure CPU code compiled with PGI.

maxpkatz commented 4 years ago

Closing since this is not actually a code bug. We probably need to revisit this issue of numerical sensitivity in the boundary conditions, though.