Error! Tl1_2 has <= 0 values with ne120 on pm-gpu with default NCPL=96

ndkeen commented 2 years ago

With master of May27th, I'm seeing a familiar error with ne120, but it happens pretty soon -- after first step on pm-gpu. When I change to ATM_NCPL=288 (from 96) it runs for 1 day.

With same repo, when I try on pm-cpu, it completes 1 day (using default ATM_NCPL=96). Note there are about 340 warnings on rank0, presumably 512x that in total (with 512 MPI's) -- can I turn those off?

Can be reproduced with SMS_P128_Ld1.ne120_r0125_oRRS18to6v3.F2010-SCREAMv1.pm-gpu_gnugpu


37: WARNING: Post-condition property check failed and repaired.
 37:   - Property check name: aero_ssa_sw within interval [0, 1]
 37:   - Atmosphere process name: Simple Prescribed Aerosols (SPA)
 37:   - Atmosphere process MPI Rank: 37
 37: KERNEL CHECK FAILED:
 37:    !(is_neg_Tl1_2.any())
 37:    Error! Tl1_2 has <= 0 values.
 37: 
 37: KERNEL CHECK FAILED:
 37:    !(is_neg_Tl1_2.any())
 37:    Error! Tl1_2 has <= 0 values.
 37: 
 37: :0: : block: [2435,0,0], thread: [0,70,0] Assertion `` failed.
 37: :0: : block: [2664,0,0], thread: [0,70,0] Assertion `` failed.
 37: terminate called after throwing an instance of 'std::runtime_error'
 37:   what():  cudaDeviceSynchronize() error( cudaErrorAssert): device-side assert triggered /global/cfs/cdirs/e3sm/ndk/se06-may27/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:150
 37: Traceback functionality not available
 37: 
 37: 
 37: Program received signal SIGABRT: Process abort signal.
 37: 
...
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:568
 37: #17  0x19b112e in _ZN6Kokkos5fenceEv
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:1097
 37: #18  0x17e2609 in _ZN6scream4shoc9FunctionsIdN6Kokkos6DeviceINS2_4CudaENS2_9CudaSpaceEEEE9shoc_mainERKiS9_S9_S9_S9_S9_RKdRKN4ekat16WorkspaceManagerINSC_4PackIdLi1EEES6_EERKNS7_9SHOCInputERKNS7_15SHOCInputOutputERKNS7_10SHOCOutputERKNS7_17SHOCHistoryOutputE
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/physics/shoc/shoc_main_impl.hpp:382
 37: #19  0x17cf5cc in _ZN6scream16SHOCMacrophysics8run_implEi
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/physics/shoc/atmosphere_macrophysics.cpp:415
 37: #20  0x18e3e6b in _ZN6scream17AtmosphereProcess3runEi
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process.cpp:65
 37: #21  0x18e7fb8 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process_group.cpp:217
 37: #22  0x18e3e6b in _ZN6scream17AtmosphereProcess3runEi
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process.cpp:65
 37: #23  0x18e7fb8 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process_group.cpp:217
 37: #24  0x18e3e6b in _ZN6scream17AtmosphereProcess3runEi
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process.cpp:65
 37: #25  0x18e7fb8 in _ZN6scream22AtmosphereProcessGroup14run_sequentialEd
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process_group.cpp:217
 37: #26  0x18e3e6b in _ZN6scream17AtmosphereProcess3runEi
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/share/atm_process/atmosphere_process.cpp:65
 37: #27  0xf5e60a in _ZN6scream7control16AtmosphereDriver3runEi
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/control/atmosphere_driver.cpp:925
 37: #28  0x579711 in operator()
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:200
 37: #29  0x579711 in fpe_guard_wrapper<scream_run(int)::<lambda()> >
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:51
 37: #30  0x579711 in scream_run
 37:    at /global/cfs/cdirs/e3sm/ndk/se06-may27/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:197
 37: #31  0x573544 in __atm_comp_mct_MOD_atm_run_mct
...

crterai commented 2 years ago

To add a data point: I tried a test with today's master with default NCPL (96) on cori-knl_gnu and it ran for 10hr 30min (42 timesteps) before running out of time on the debug queue.

ndkeen commented 2 years ago

I tried this test again using master of June 6th and I see the same issue. And again it works with ATM_NCPL=288

ndkeen commented 2 years ago

I also tried ne120 case on chrysalis. First with ATM_NCPL=288 to compare directly with PM, but then I tried with default timestepping and it completed 1 day (with GNU). So there must be something acting differently on PM.

It is also still the case that ne120 runs ok (for one day) using pm-cpu, so this issue seems only on GPU.

PeterCaldwell commented 2 years ago

So there must be something acting differently on PM.

Ugh. I wonder if this is the return of the "hot planet" problems. Were the failing runs with fully optimized builds? Have you tried a debug build on gnugpu?

ndkeen commented 2 years ago

If I turn on the kokkos debug macros, it will complete one day on the GPU's with default NCPL=96. So it could still be that we have a race condition. https://github.com/E3SM-Project/scream/issues/1557

ndkeen commented 2 years ago

Noting that others have been able to run ne120 with default NCPL=96 on summit GPU for 5 days.

ndkeen commented 2 years ago

I still see the same issue with June 13th repo (and using nospa). Where optimized build fails after 1 or 2 steps, but setting NCPL=288 will run 1 day. I can also use default NCPL=96 and run 1 day if I set kokkos debug macros.

ambrad commented 2 years ago

If you're willing to entertain a random shot in the dark, try adding team.team_barrier(); at line 70 of src/physics/shoc/shoc_linear_interp_impl.hpp, i.e., right after the kernel that has this one line:

y2(k2).set(y2(k2) < minthresh, minthresh);

Edit: If by chance this works, it won't actually be the correct solution because it's too big a hammer; rather, there will be two places in SHOC to insert team.team_barrier() at a finer-grained level.

ndkeen commented 2 years ago

I still see same behavior. After first step, hits the same error.

login33% git diff  components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
diff --git a/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp b/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
index a94637c377..cf80b9b1e0 100644
--- a/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
+++ b/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
@@ -67,6 +67,7 @@ void Functions<S,D>::linear_interp(
   Kokkos::parallel_for(Kokkos::TeamThreadRange(team, km2_pack), [&] (const Int& k2) {
     y2(k2).set(y2(k2) < minthresh, minthresh);
   });
+  team.team_barrier(); //ndk via AB
 }

 } // namespace shoc

ambrad commented 2 years ago

Do you know of a reproducer of bad results of any sort on the PM GPUs that occurs in a problem size smaller than ne120? If not, any idea of the smallest number of nodes on which one can run ne120?

ndkeen commented 2 years ago

Previously ( a few PR's ago ), we had similar (likely same) issue https://github.com/E3SM-Project/scream/issues/1557 with ne30 as well. It was same situation where we had trouble with OPT, but not DEBUG and tracked down the differences to simply the kokkos macros. However, now ne30 seems to work fine. With this ne120 I'm afraid we need 32 nodes of pm-gpu.

ambrad commented 2 years ago

How long have you run ne30? I ran ne30 out of the box last night on PM and it failed at Atmosphere step = 980 model time = 0001-01-21 10:00:00 with no useful error output. I'm rerunning it (probably will have to do it a few times before I figure out how to get the output I need) to see if I can get more info.

ndkeen commented 2 years ago

I've not run ne30 very long. I'm not sure if default writes snapshots? We know that each snapshot is leaking memory https://github.com/E3SM-Project/scream/issues/1565.

ambrad commented 2 years ago

I shouldn't have written "out of the box". I meant that the model config (time steps, subcycling, etc.) is out of the box, but I definitely am suppressing all output right now. Also, I now realize I had switched to SL transport (so definitely not out of the box; sorry), so for reproducibility in the future, I'll switch back to Eul.

ambrad commented 2 years ago

ne30 update. Switched to Eul, so now I'm running default model config with 0 output. Failure is this at time step 1052:

2: Error! Failed post-condition check (cannot be repaired).
 2:   - Atm process name: Dynamics
 2:   - Property check name: ps within interval [40000, 110000]
 2:   - Atmosphere process MPI Rank: 2
 2:   - Error message: Check failed.
 2:   - check name: ps within interval [40000, 110000]
 2:   - field id: ps[Physics GLL] <double:COL>(3042) [m^-1 s^-2 kg]
 2:   - minimum:
 2:     - value: 39994.5
 2:     - entry: (15239)
 2:     - lat/lon: (30.7265, 84.8292)
 2:   - maximum:
 2:     - value: 69549.7
 2:     - entry: (7656)
 2:     - lat/lon: (35.1057, 20.1708)

I'm going to rerun with the ps limit increased.

ambrad commented 2 years ago

Another run fails earlier (step 934), strongly suggesting nondeterminism (since the ps check leads to immediately failure, so relaxed bounds shouldn't change the answer). In this run, I again get no useful output.

Edit: And another at step 978. Edit: And two more at 967 and 740.

ambrad commented 2 years ago

Relevant to determinism, from Ben: atomicAdd's at line 295 in components/eam/src/physics/rrtmgp/external/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp and line 30 in components/eam/src/physics/rrtmgp/external/cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp. So I might make those deterministic and then test again.

ndkeen commented 2 years ago

I typically run ne30 for only 5 days. When I tried to run for a month, indeed it stops as Andrew sees on pm-gpu. I ran it again and it stopped again at a different location (step=1011 and step=1039). I verified that it does run for a complete month using pm-cpu. I then built with the kokkos debug macros and ran again on pm-gpu. It completes the month. Note that, with the kokkos debug macros, it is about 1.7x slower.

ambrad commented 2 years ago

Suggestion: If it completes 1 month with the debug macros, try 3. If it fails, then check if it's deterministic or not. You can also check determinism with ~10-day runs by setting statefreq=480 and leaving disable_diagnostics at its default False value. Then diff, e.g., qv( 1) output lines in homme_atm.log.

ndkeen commented 2 years ago

I submitted a job to run for a year with the kokkos debug macros. The job completed 1 year (17.7 sypd on 8 nodes).

ambrad commented 2 years ago

Looks like qv(1) output is a good proxy for this issue. See #1750 for details.

ambrad commented 2 years ago

@ndkeen's Kokkos debug flags make the qv output reasonable, so indeed ne30 with qv output is a good reproducer for this problem. Edit: I use this to insert the flags (edited from having =; the python parser wants just spaces between key and value):

./xmlchange --append SCREAM_CMAKE_OPTIONS="Kokkos_ENABLE_DEBUG TRUE Kokkos_ENABLE_DEBUG_BOUNDS_CHECK TRUE"

ambrad commented 2 years ago

Status: The mystery remains. Here I summarize what I've tried so far.

From my notes:

   no rrtmgp: grep "qv(  1)=" homme_atm.log.2412793.220619-130412       bad
 also H mode: grep "qv(  1)=" homme_atm.log.2415318.220619-143851 still bad
  also no p3: grep "qv(  1)=" homme_atm.log.2416904.220619-160319 still bad
fences in scpl (?), team_barrier in shoc lininterp (seems not needed): still bad
  same but H mode: still bad
    same but with tons of fences in AD files: still bad

switch to setting defines for for Kokkos in various subcomponents: P3, SHOC, RRTMGP, surface coupling, dynamics AD driver, AD remapper
#define KOKKOS_ENABLE_DEBUG   <--- leads to seg fault. so can't do this for just some files. Kokkos must be built debug.
#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK  <--- permitted, but still bad

By setting defines I mean like this:

diff --git a/components/scream/src/physics/shoc/shoc_main_impl.hpp b/components/scream/src/physics/shoc/shoc_main_impl.hpp                                                                                                            
index 3f7a6bcea7..d7fb789b50 100644                                                                                                                                                                                                   
--- a/components/scream/src/physics/shoc/shoc_main_impl.hpp                                                                                                                                                                           
+++ b/components/scream/src/physics/shoc/shoc_main_impl.hpp                                                                                                                                                                           
@@ -1,6 +1,8 @@                                                                                                                                                                                                                       
 #ifndef SHOC_MAIN_IMPL_HPP                                                                                                                                                                                                           
 #define SHOC_MAIN_IMPL_HPP                                                                                                                                                                                                           

+#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK                                                                                                                                                                                             
+                                                                                                                                                                                                                                     
 #include "shoc_functions.hpp" // for ETI only but harmless for GPU                                                                                                                                                                   

 #include "ekat/kokkos/ekat_subview_utils.hpp"

ndkeen commented 2 years ago

Thanks for the tip using xmlchange -- indeed that's easier. ./xmlchange --append SCREAM_CMAKE_OPTIONS="Kokkos_ENABLE_DEBUG=TRUE Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=TRUE"

I did spend some effort trying to reduce what is meant by these DEBUG flags -- ie hacking Kokkos so that fewer things are different when I have these set. I thought I was making progress, but in the end it got too confusing.

Do you agree it seems odd that these 2 macros have such effect?

With a true scream DEBUG attempt, I noticed the INFORM line below:

  0: Create Pool
  0: NVIDIA A100-SXM4-40GB
  0: INFORM: Automatically inserting fence() after every parallel_for
  0: [EAMXX] initialize_atm_procs ... done!

Is it easy or worthwhile to have an OPT build also insert fences after every parallel_for (as a test only)? And maybe after any kokkos parallel section?

ambrad commented 2 years ago

It might be. I've tried with no success so far. FYI, the message about fence is from YAKL, so it applies only to YAKL parallel_for's.

ambrad commented 2 years ago

A clue: In atmosphere_macrophysics.cpp, I send in a copy of qv so that SHOC doesn't update the field-managed qv. First copy the current field-managed qv to a local deep copy.

  auto qv_copy = decltype(qv)("qv_copy", qv.extent(0), qv.extent(1));
  Kokkos::deep_copy(qv_copy, qv);

Then use qv_copy in the pre/post_process.set_variables calls. This gives

 qv(  1)=   0.1490436289941499E-06  0.2293270640075207E-01  0.3130485544141958E+04
 qv(  1)=   0.2143987508901469E-06  0.2337311497819286E-01  0.3130485544141958E+04
 qv(  1)=   0.2447607835196377E-06  0.2386589242512729E-01  0.3130485544141958E+04
 qv(  1)=   0.4202401184276581E-06  0.2446347536817830E-01  0.3130485544141958E+04
 qv(  1)=   0.5981288283715515E-06  0.1355537078452900E-01  0.3149072451903349E+04
 qv(  1)=   0.6212727534235165E-06  0.1239548660204261E-01  0.3155109606859552E+04
 qv(  1)=   0.6304034566324581E-06  0.1084175897403992E-01  0.3165723663929458E+04
 qv(  1)=   0.6401187497050900E-06  0.1038225286286067E-01  0.3183673900807513E+04
 qv(  1)=   0.6554598074739484E-06  0.9218549438863261E-02  0.3207522963873035E+04
 qv(  1)=   0.6627259705037200E-06  0.8261634658853599E-02  0.3233565829408431E+04
 qv(  1)=   0.6679467303655037E-06  0.8276237328978172E-02  0.3263389412549949E+04
 qv(  1)=   0.6753990393311130E-06  0.8266310222328018E-02  0.3295978223359408E+04
 qv(  1)=   0.6855216023587357E-06  0.8096898159690894E-02  0.3328738944800448E+04

Thus, next I'm going to see if I can isolate this to one of the various uses of qv in SHOC.

ambrad commented 2 years ago

A more fine-grained clue: Commenting out this line:

qw_s(nlev-1)     += cmnfac*wqw_sfc;

in shoc_update_prognostics_implicit_impl.hpp prevents the qv global mass from increasing. So the issue might be in surface coupling rather than in SHOC.

ambrad commented 2 years ago

From a different direction, I tried to isolate the uses of KOKKOS_ENABLE_DEBUG/_BOUNDS_CHECK in Kokkos and remove them. After removing every one that grepping revealed, I got the same result as if they were still there. Then I modified the one compiler arg difference, the presence of -lineinfo. Still the same. So there's some other important side effect of

./xmlchange --append SCREAM_CMAKE_OPTIONS="Kokkos_ENABLE_DEBUG=TRUE Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=TRUE"

that I'm not seeing yet.

ambrad commented 2 years ago

Ok, interesting, writing that made me check one other place. Turns out the use of KOKKOS_ENABLE_DEBUG in components/homme/src/share/cxx/ExecSpaceDefs.cpp may be a crucial difference. Following that lead now.

ambrad commented 2 years ago

Confirmed: If in an opt build you set max_num_warps to 8 in components/homme/src/share/cxx/ExecSpaceDefs.cpp, then the qv(1) proxy looks good.

ambrad commented 2 years ago

Making good progress. I'm homing in on something amiss in Homme's RemapFunctor.hpp or lower.

ambrad commented 2 years ago

Found it. HybridVCoord::compute_ps_ref_from_dp is missing a team barrier. I'll include this fix in the upstream PR to fix the team size in the debug build. @ndkeen if you'd like to try the fix, here's the diff:

diff --git a/components/homme/src/share/cxx/HybridVCoord.hpp b/components/homme/src/share/cxx/HybridVCoord.hpp                                                                                                                                                                                
index 1348246c00..b96237acc0 100644                                                                                                                                                           
--- a/components/homme/src/share/cxx/HybridVCoord.hpp                                                                                                                                         
+++ b/components/homme/src/share/cxx/HybridVCoord.hpp                                                                                                                                         
@@ -108,6 +108,7 @@ public:                                                                                                                                                                   
       auto dp_ij = Homme::subview(dp,igp,jgp);                                                                                                                                               

       ColumnOps::column_reduction<NUM_PHYSICAL_LEV>(kv,dp_ij,ps(igp,jgp));                                                                                                                   
+      kv.team_barrier();                                                                                                                                                                     

       Kokkos::single(Kokkos::PerThread(kv.team),[&](){                                                                                                                                       
         ps(igp,jgp) += hybrid_ai0*ps0;

Edit: I was debugging with ne4. But ne30 looks good, too.

AaronDonahue commented 2 years ago

Excellent!

PeterCaldwell commented 2 years ago

Yay!!!! This is a big deal, huge thanks for tracking it down, Andrew!

ambrad commented 2 years ago

ne30, statefreq=480, SL transport, 16 GPUs, 1 hour limit ran 4041 steps before running out of time. The qv(1) Homme diagnostic output looks solid. However, the performance is terrible, the cause of which I've documented in the new issue #1757.

Edit: The performance issue was fixed in the upstream PR 5046.

ambrad commented 2 years ago

See discussion starting at this comment, where it appears that the team_barrier that fixes this issue also fixes preqxx on Crusher. This suggests that the MI250X in addition to the A100 has less h/w lockstepping than the V100.

ndkeen commented 2 years ago

I can confirm that with this bugfix (in July5th repo), I can run ne120 F cases on pm-gpu using default NCPL=96 for 1 day and get performance results that seem reasonable for ATM.

I think PR #5046 fixes the issue.

ambrad commented 2 years ago

Great. Should we close this issue?

E3SM-Project / scream

Error! Tl1_2 has <= 0 values with ne120 on pm-gpu with default NCPL=96 #1691