Closed ndkeen closed 2 years ago
To add a data point: I tried a test with today's master with default NCPL (96) on cori-knl_gnu
and it ran for 10hr 30min (42 timesteps) before running out of time on the debug queue.
I tried this test again using master of June 6th and I see the same issue. And again it works with ATM_NCPL=288
I also tried ne120 case on chrysalis. First with ATM_NCPL=288
to compare directly with PM, but then I tried with default timestepping and it completed 1 day (with GNU). So there must be something acting differently on PM.
It is also still the case that ne120 runs ok (for one day) using pm-cpu, so this issue seems only on GPU.
So there must be something acting differently on PM.
Ugh. I wonder if this is the return of the "hot planet" problems. Were the failing runs with fully optimized builds? Have you tried a debug build on gnugpu?
If I turn on the kokkos debug macros, it will complete one day on the GPU's with default NCPL=96. So it could still be that we have a race condition. https://github.com/E3SM-Project/scream/issues/1557
Noting that others have been able to run ne120 with default NCPL=96 on summit GPU for 5 days.
I still see the same issue with June 13th repo (and using nospa). Where optimized build fails after 1 or 2 steps, but setting NCPL=288 will run 1 day. I can also use default NCPL=96 and run 1 day if I set kokkos debug macros.
If you're willing to entertain a random shot in the dark, try adding team.team_barrier();
at line 70 of src/physics/shoc/shoc_linear_interp_impl.hpp, i.e., right after the kernel that has this one line:
y2(k2).set(y2(k2) < minthresh, minthresh);
Edit: If by chance this works, it won't actually be the correct solution because it's too big a hammer; rather, there will be two places in SHOC to insert team.team_barrier() at a finer-grained level.
I still see same behavior. After first step, hits the same error.
login33% git diff components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
diff --git a/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp b/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
index a94637c377..cf80b9b1e0 100644
--- a/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
+++ b/components/scream/src/physics/shoc/shoc_linear_interp_impl.hpp
@@ -67,6 +67,7 @@ void Functions<S,D>::linear_interp(
Kokkos::parallel_for(Kokkos::TeamThreadRange(team, km2_pack), [&] (const Int& k2) {
y2(k2).set(y2(k2) < minthresh, minthresh);
});
+ team.team_barrier(); //ndk via AB
}
} // namespace shoc
Do you know of a reproducer of bad results of any sort on the PM GPUs that occurs in a problem size smaller than ne120? If not, any idea of the smallest number of nodes on which one can run ne120?
Previously ( a few PR's ago ), we had similar (likely same) issue https://github.com/E3SM-Project/scream/issues/1557 with ne30 as well. It was same situation where we had trouble with OPT, but not DEBUG and tracked down the differences to simply the kokkos macros. However, now ne30 seems to work fine. With this ne120 I'm afraid we need 32 nodes of pm-gpu.
How long have you run ne30? I ran ne30 out of the box last night on PM and it failed at Atmosphere step = 980 model time = 0001-01-21 10:00:00
with no useful error output. I'm rerunning it (probably will have to do it a few times before I figure out how to get the output I need) to see if I can get more info.
I've not run ne30 very long. I'm not sure if default writes snapshots? We know that each snapshot is leaking memory https://github.com/E3SM-Project/scream/issues/1565.
I shouldn't have written "out of the box". I meant that the model config (time steps, subcycling, etc.) is out of the box, but I definitely am suppressing all output right now. Also, I now realize I had switched to SL transport (so definitely not out of the box; sorry), so for reproducibility in the future, I'll switch back to Eul.
ne30 update. Switched to Eul, so now I'm running default model config with 0 output. Failure is this at time step 1052:
2: Error! Failed post-condition check (cannot be repaired).
2: - Atm process name: Dynamics
2: - Property check name: ps within interval [40000, 110000]
2: - Atmosphere process MPI Rank: 2
2: - Error message: Check failed.
2: - check name: ps within interval [40000, 110000]
2: - field id: ps[Physics GLL] <double:COL>(3042) [m^-1 s^-2 kg]
2: - minimum:
2: - value: 39994.5
2: - entry: (15239)
2: - lat/lon: (30.7265, 84.8292)
2: - maximum:
2: - value: 69549.7
2: - entry: (7656)
2: - lat/lon: (35.1057, 20.1708)
I'm going to rerun with the ps limit increased.
Another run fails earlier (step 934), strongly suggesting nondeterminism (since the ps check leads to immediately failure, so relaxed bounds shouldn't change the answer). In this run, I again get no useful output.
Edit: And another at step 978. Edit: And two more at 967 and 740.
Relevant to determinism, from Ben: atomicAdd's at line 295 in components/eam/src/physics/rrtmgp/external/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp and line 30 in components/eam/src/physics/rrtmgp/external/cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp. So I might make those deterministic and then test again.
I typically run ne30 for only 5 days. When I tried to run for a month, indeed it stops as Andrew sees on pm-gpu. I ran it again and it stopped again at a different location (step=1011 and step=1039). I verified that it does run for a complete month using pm-cpu. I then built with the kokkos debug macros and ran again on pm-gpu. It completes the month. Note that, with the kokkos debug macros, it is about 1.7x slower.
Suggestion: If it completes 1 month with the debug macros, try 3. If it fails, then check if it's deterministic or not. You can also check determinism with ~10-day runs by setting statefreq=480 and leaving disable_diagnostics at its default False value. Then diff, e.g., qv( 1)
output lines in homme_atm.log.
I submitted a job to run for a year with the kokkos debug macros. The job completed 1 year (17.7 sypd on 8 nodes).
Looks like qv(1) output is a good proxy for this issue. See #1750 for details.
@ndkeen's Kokkos debug flags make the qv output reasonable, so indeed ne30 with qv output is a good reproducer for this problem. Edit: I use this to insert the flags (edited from having =; the python parser wants just spaces between key and value):
./xmlchange --append SCREAM_CMAKE_OPTIONS="Kokkos_ENABLE_DEBUG TRUE Kokkos_ENABLE_DEBUG_BOUNDS_CHECK TRUE"
Status: The mystery remains. Here I summarize what I've tried so far.
From my notes:
no rrtmgp: grep "qv( 1)=" homme_atm.log.2412793.220619-130412 bad
also H mode: grep "qv( 1)=" homme_atm.log.2415318.220619-143851 still bad
also no p3: grep "qv( 1)=" homme_atm.log.2416904.220619-160319 still bad
fences in scpl (?), team_barrier in shoc lininterp (seems not needed): still bad
same but H mode: still bad
same but with tons of fences in AD files: still bad
switch to setting defines for for Kokkos in various subcomponents: P3, SHOC, RRTMGP, surface coupling, dynamics AD driver, AD remapper
#define KOKKOS_ENABLE_DEBUG <--- leads to seg fault. so can't do this for just some files. Kokkos must be built debug.
#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK <--- permitted, but still bad
By setting defines I mean like this:
diff --git a/components/scream/src/physics/shoc/shoc_main_impl.hpp b/components/scream/src/physics/shoc/shoc_main_impl.hpp
index 3f7a6bcea7..d7fb789b50 100644
--- a/components/scream/src/physics/shoc/shoc_main_impl.hpp
+++ b/components/scream/src/physics/shoc/shoc_main_impl.hpp
@@ -1,6 +1,8 @@
#ifndef SHOC_MAIN_IMPL_HPP
#define SHOC_MAIN_IMPL_HPP
+#define KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK
+
#include "shoc_functions.hpp" // for ETI only but harmless for GPU
#include "ekat/kokkos/ekat_subview_utils.hpp"
Thanks for the tip using xmlchange -- indeed that's easier.
./xmlchange --append SCREAM_CMAKE_OPTIONS="Kokkos_ENABLE_DEBUG=TRUE Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=TRUE"
I did spend some effort trying to reduce what is meant by these DEBUG flags -- ie hacking Kokkos so that fewer things are different when I have these set. I thought I was making progress, but in the end it got too confusing.
Do you agree it seems odd that these 2 macros have such effect?
With a true scream DEBUG attempt, I noticed the INFORM line below:
0: Create Pool
0: NVIDIA A100-SXM4-40GB
0: INFORM: Automatically inserting fence() after every parallel_for
0: [EAMXX] initialize_atm_procs ... done!
Is it easy or worthwhile to have an OPT build also insert fences after every parallel_for (as a test only)? And maybe after any kokkos parallel section?
It might be. I've tried with no success so far. FYI, the message about fence is from YAKL, so it applies only to YAKL parallel_for's.
A clue: In atmosphere_macrophysics.cpp, I send in a copy of qv so that SHOC doesn't update the field-managed qv. First copy the current field-managed qv to a local deep copy.
auto qv_copy = decltype(qv)("qv_copy", qv.extent(0), qv.extent(1));
Kokkos::deep_copy(qv_copy, qv);
Then use qv_copy in the pre/post_process.set_variables
calls. This gives
qv( 1)= 0.1490436289941499E-06 0.2293270640075207E-01 0.3130485544141958E+04
qv( 1)= 0.2143987508901469E-06 0.2337311497819286E-01 0.3130485544141958E+04
qv( 1)= 0.2447607835196377E-06 0.2386589242512729E-01 0.3130485544141958E+04
qv( 1)= 0.4202401184276581E-06 0.2446347536817830E-01 0.3130485544141958E+04
qv( 1)= 0.5981288283715515E-06 0.1355537078452900E-01 0.3149072451903349E+04
qv( 1)= 0.6212727534235165E-06 0.1239548660204261E-01 0.3155109606859552E+04
qv( 1)= 0.6304034566324581E-06 0.1084175897403992E-01 0.3165723663929458E+04
qv( 1)= 0.6401187497050900E-06 0.1038225286286067E-01 0.3183673900807513E+04
qv( 1)= 0.6554598074739484E-06 0.9218549438863261E-02 0.3207522963873035E+04
qv( 1)= 0.6627259705037200E-06 0.8261634658853599E-02 0.3233565829408431E+04
qv( 1)= 0.6679467303655037E-06 0.8276237328978172E-02 0.3263389412549949E+04
qv( 1)= 0.6753990393311130E-06 0.8266310222328018E-02 0.3295978223359408E+04
qv( 1)= 0.6855216023587357E-06 0.8096898159690894E-02 0.3328738944800448E+04
Thus, next I'm going to see if I can isolate this to one of the various uses of qv in SHOC.
A more fine-grained clue: Commenting out this line:
qw_s(nlev-1) += cmnfac*wqw_sfc;
in shoc_update_prognostics_implicit_impl.hpp prevents the qv global mass from increasing. So the issue might be in surface coupling rather than in SHOC.
From a different direction, I tried to isolate the uses of KOKKOS_ENABLE_DEBUG/_BOUNDS_CHECK in Kokkos and remove them. After removing every one that grepping revealed, I got the same result as if they were still there. Then I modified the one compiler arg difference, the presence of -lineinfo. Still the same. So there's some other important side effect of
./xmlchange --append SCREAM_CMAKE_OPTIONS="Kokkos_ENABLE_DEBUG=TRUE Kokkos_ENABLE_DEBUG_BOUNDS_CHECK=TRUE"
that I'm not seeing yet.
Ok, interesting, writing that made me check one other place. Turns out the use of KOKKOS_ENABLE_DEBUG in components/homme/src/share/cxx/ExecSpaceDefs.cpp may be a crucial difference. Following that lead now.
Confirmed: If in an opt build you set max_num_warps to 8 in components/homme/src/share/cxx/ExecSpaceDefs.cpp, then the qv(1) proxy looks good.
Making good progress. I'm homing in on something amiss in Homme's RemapFunctor.hpp or lower.
Found it. HybridVCoord::compute_ps_ref_from_dp is missing a team barrier. I'll include this fix in the upstream PR to fix the team size in the debug build. @ndkeen if you'd like to try the fix, here's the diff:
diff --git a/components/homme/src/share/cxx/HybridVCoord.hpp b/components/homme/src/share/cxx/HybridVCoord.hpp
index 1348246c00..b96237acc0 100644
--- a/components/homme/src/share/cxx/HybridVCoord.hpp
+++ b/components/homme/src/share/cxx/HybridVCoord.hpp
@@ -108,6 +108,7 @@ public:
auto dp_ij = Homme::subview(dp,igp,jgp);
ColumnOps::column_reduction<NUM_PHYSICAL_LEV>(kv,dp_ij,ps(igp,jgp));
+ kv.team_barrier();
Kokkos::single(Kokkos::PerThread(kv.team),[&](){
ps(igp,jgp) += hybrid_ai0*ps0;
Edit: I was debugging with ne4. But ne30 looks good, too.
Excellent!
Yay!!!! This is a big deal, huge thanks for tracking it down, Andrew!
ne30, statefreq=480, SL transport, 16 GPUs, 1 hour limit ran 4041 steps before running out of time. The qv(1) Homme diagnostic output looks solid. However, the performance is terrible, the cause of which I've documented in the new issue #1757.
Edit: The performance issue was fixed in the upstream PR 5046.
See discussion starting at this comment, where it appears that the team_barrier that fixes this issue also fixes preqxx on Crusher. This suggests that the MI250X in addition to the A100 has less h/w lockstepping than the V100.
I can confirm that with this bugfix (in July5th repo), I can run ne120 F cases on pm-gpu using default NCPL=96 for 1 day and get performance results that seem reasonable for ATM.
I think PR #5046 fixes the issue.
Great. Should we close this issue?
With master of May27th, I'm seeing a familiar error with ne120, but it happens pretty soon -- after first step on pm-gpu. When I change to
ATM_NCPL=288
(from 96) it runs for 1 day.With same repo, when I try on pm-cpu, it completes 1 day (using default
ATM_NCPL=96
). Note there are about 340 warnings on rank0, presumably 512x that in total (with 512 MPI's) -- can I turn those off?Can be reproduced with
SMS_P128_Ld1.ne120_r0125_oRRS18to6v3.F2010-SCREAMv1.pm-gpu_gnugpu