Closed PeterCaldwell closed 2 years ago
As a sanity check, below are plots of ne120 before and after the omega fix. Compare in particular the colorbars for the bottom left panels (near-surf T at the end of the run)... before the omega fix, max is ~304K and afterwards it is 419K. Note as well the top left panel. This shows the timeseries (along the x axis) of the vertical profiles of global-ave T. First, note that both runs are approximately equal in length (x axis units is fraction of a day). Note as well that the post-omega fix has huge lowest-level T basically immediately and that huge value persists for the rest of the run. ne120_T_mid_pre-omegafix.pdf ne120_T_mid_post-omegafix.pdf
Of course, we don't know that it was the omega PR that caused this change because we had been using an older tag for eval, then jumped to the post-omega version. We need to bisect to confirm that omega was the source of trouble.
Yeah, I think we should try the version right before the omega PR merge. Then we'll know if and how the change contributed to the bad T profile.
Update: If I understand correctly, both of Noel's ne120 runs I analyze above were from the same repo and only differed in whether they had the omega bugfix tacked on. Which makes it sound like our bisection is done - it must be the omega bugfix. This is inconsistent, however, with a run Chris T just did using master from Apr 11th (way before the omega bugfix) which also seemed to get hot.
Another weird thing is that all of these hot runs were compiled in optimized mode. I figured this wouldn't matter, but when I analyzed a run Noel did with the bugfix but compiled in debug mode, the simulation ran longer and did not get really hot! ne120_T_mid_post-omegafix-debug-compile.pdf
I'm assuming Noel runs are on GPU. Are Christ runs on CPU? In other words, does the arch seem to impact things?
I found something similar with the clubb code with the intel compiler on some architectures. A newer version fixed that issue. I tracked it down (using perturbation growth test) and it was indeed a compiler bug (More details here). Perhaps, we can use a newer version of the compiler to see if the problem goes away.
Interesting comments, you two. I think both @crterai and @ndkeen are both using Perlmutter GPUs for their runs. I think Chris was going to try running on Cori - he would need to turn off SPA to do that to avoid OOM - but I'm not sure what the results of that are...
My simulations that get hot have been on GPUs as well. I do find it odd that the simulations with DEBUG compile seem to run longer. Mine have all not been debug compile.
We should fast-track the SPA pr perhaps, and try on CPU as well. That might tell us if the issue is related to the backend.
Though we should figure out the discrepancy between Noel and Chris regarding when the code started to go bananas.
I just ran ~6 hrs of ne30 with optimized build on quartz. Max T is 350K (over Australia) by the end of the run. This seems a bit high, but is nothing like what we're seeing on perlmutter and the run continued 4x longer than Chris has managed so far. So I think we can conclude there's probably something wrong with Perlmutter. It would be good to verify on Cori.
I should mention that this run does crash eventually due to qv [not] within interval [1e-13, 0.2]
in microphysics.
I can confirm that on cori, the simulation doesn't get ridiculously hot as it did on PM. The simulation still crashes for a different reason around the same time the others have crashed (1hr30min).
474: WARNING: Post-condition property check failed and repaired.
474: - Property check name: CheckAndRepairWrapper
474: - Check : tracers lower bound check: -1e-17
474: - Repair: Positivity check for field tracers
474:
474: - Atmosphere process name: Dynamics
474: - Atmosphere process MPI Rank: 474
474: forrtl: error (65): floating invalid
474: Image PC Routine Line Source
474: e3sm.exe 0000000008493324 Unknown Unknown Unknown
474: e3sm.exe 0000000006CAF1D0 Unknown Unknown Unknown
474: e3sm.exe 00000000055CCC2A shr_const_mod_mp_ 94 shr_const_mod.F90
474: e3sm.exe 000000000542E8E8 mct_mod_mp_mct_av 1025 mct_mod.F90
474: e3sm.exe 000000000047C5E5 component_mod_mp_ 760 component_mod.F90
474: e3sm.exe 000000000043B48C cime_comp_mod_mp_ 3085 cime_comp_mod.F90
474: e3sm.exe 000000000046354C MAIN__ 153 cime_driver.F90
474: e3sm.exe 00000000004020F2 Unknown Unknown Unknown
474: e3sm.exe 0000000008A3E34F Unknown Unknown Unknown
474: e3sm.exe 0000000000401FDA Unknown Unknown Unknown
srun: error: nid02620: task 474: Aborted
I verified that with a master of April 27th, I also see this issue. With ne30, fully OPT builds will run for 14 steps and stop with
1: terminate called after throwing an instance of 'std::logic_error'
1: what(): /global/cfs/cdirs/e3sm/ndk/se02-apr27/components/scream/src/share/atm_process/atmosphere_process.cpp:282: FAIL:
1: false
1: Error! Failed post-condition check (cannot be repaired).
1: - Atm process name: Dynamics
1: - Property check name: T_mid within interval [140, 500]
1: - Atmosphere process MPI Rank: 1
1: - Error message: FieldWithinIntervalCheck failed; min = 188.589752; max = 523.822598
Chris T verified it looks like the same too hot issue: https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne30_v1_PM_quicktest_220428.pdf
Implementing #1572 (UVM/run select kernels on host) would be useful for debugging this problem, which raises it in my priority...
@PeterCaldwell I am working on it in EKAT. So far it seems like it's going well. I hope to finish it tomorrow.
When we build DEBUG, the results seem OK. When we build OPT, it will still run (mileage may vary), but the results quickly look wild. I have spent a lot of time trying to figure out what's happening and after a lot of trial-n-error, I have found that I can get the same behavior as the DEBUG runs by turning on certain KOKKOS debug macros.
set(Kokkos_ENABLE_DEBUG TRUE CACHE BOOL "")
set(Kokkos_ENABLE_DEBUG_BOUNDS_CHECK TRUE CACHE BOOL "")
If I add those 2 lines for CUDA/OPT builds, it seems to run OK on PM. And at about half the speed we expect (compared to quite a bit slower with full DEBUG). Note this is otherwise optimized build -- ie this is the ONLY thing needed (other than I also need the same work-around for the team size too large issue https://github.com/E3SM-Project/scream/issues/1485).
Note that the BOUNDS_CHECK is really what matters here, but I have to also set the first macro to get around various team size too large issues.
Chatting with C Trott, he suggested this could be a race condition as it is essentially slowing down the GPU's.
I spent some time looking closer at how those flags are used within KOKKOS and learned a little bit more, but ultimately had to give up trying to find a change that has lower impact.
If I could find a way to alter each parallel section to be "safer" in some way, I might be able to rule out race condition or narrow down location.
@ndkeen
Could be missing team_barrier or fence calls. Given that the unit tests for the low-level components pass on weaver in the release build, the likeliest spots missing such barriers are the interfaces.
Is there something I can add after various kokkos sections to see if it helps?
Well, outside of top-level parallelfor/reduce/scans you can add Kokkos::fence()
. For completeness, put one call before and one call after the parallel*.
Inside of a kernel -- that is, inside the top-level parallel_* -- you can write team.team_barrier()
before and after each TeamThreadRange loop.
I've been adding fences before/after any parallel_
regions. Within P3 I did that and did not see any changes. And within shoc, I've been getting various runtime fails. I'm doing:
Kokkos::fence();
Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev_pack), [&] (const Int& k) {
...
});
Kokkos::fence();
why would adding these cause runtime fails in certain locations?
What are the runtime fail messages? I agree with you that it seems that putting fences right after each kernel should not cause a failure. But maybe it's changing the solution, which then causes a failure.
While some sets of files including fences will run (though T still too large), others will fail with either:
0: :0: : block: [284,0,0], thread: [0,32,0] Assertion `Number of arguments passed to Kokkos::View() constructor must match the dynamic rank of the view.` failed.
or
0: terminate called after throwing an instance of 'std::runtime_error'
0: what(): cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:150
0: Traceback functionality not available
...
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:964
0: #15 0x19b96be in _ZN6Kokkos4Impl16ExecSpaceManager12static_fenceERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:135
0: #16 0x19b96be in _ZN6Kokkos4Impl48_GLOBAL__N__38a2ddd6_15_Kokkos_Core_cpp_889c95a614fence_internalERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:568
0: #17 0x19b96be in _ZN6Kokkos5fenceEv
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:1097
0: #18 0x17e82e8 in _ZN6scream4shoc9FunctionsIdN6Kokkos6DeviceINS2_4CudaENS2_9CudaSpaceEEEE9shoc_mainERKiS9_S9_S9_S9_S9_RKdRKN4ekat16WorkspaceManagerINSC_4PackIdLi1EEES6_EERKNS7_9SHOCInputERKNS7_15SHOCInputOutputERKNS7_10SHOCOutputERKNS7_17SHOCHistoryOutputE
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/components/scream/src/physics/shoc/shoc_main_impl.hpp:380
0: #19 0x17d517a in _ZN6scream16SHOCMacrophysics8run_implEi
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/components/scream/src/physics/shoc/atmosphere_macrophysics.cpp:411
0: #20 0x18ebcfc in _ZN6scream17AtmosphereProcess3runEi
0: at /global/cfs/cdirs/e3sm/ndk/se04-may24/components/scream/src/share/atm_process/atmosphere_process.cpp:70
...
That's a Kokkos View usage error. Interesting. I wonder why we don't see this in our CPU runs. You can bisect for the incorrect usage by going into shoc_main_internal and placing a return statement at various locations until you home in on the function that is causing this error.
It's also possible that cuda's memcheck will help pinpoint the location of the error.
So it would make sense that adding fences (in locations like above) would cause errors like this? Note that I'm not trying these changes on CPU, just GPU.
Yes, a fence will halt everything and wait for the GPU to catch up. The fact that we don't see the error in a standard CPU run is puzzling, however.
I'll note that usage of this sort in shoc_functions.cpp
view_2d
z_d(temp_2d_d[0]),
cldn_d(temp_2d_d[7]);
is incorrect. Each of these need two additional args to specify the size of the view. But shoc_functions.cpp is for unit tests only.
OK using todays master, I added fences to one file and see the same Assert error as above.
cori08% git diff components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
diff --git a/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp b/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
index 16aaf5c659..e78cf9e2ad 100644
--- a/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
+++ b/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
@@ -28,6 +28,7 @@ void Functions<S,D>::vd_shoc_decomp(
const Int nlev_pack = ekat::npack<Spack>(nlev);
+ Kokkos::fence();//ndkf
// Compute entries of the tridiagonal system
Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev_pack), [&] (const Int& k) {
@@ -72,6 +73,7 @@ void Functions<S,D>::vd_shoc_decomp(
d (range_pack[p]) = d_k [p];
}
});
+ Kokkos::fence();//ndkf
}
template<typename S, typename D>
Those fences are invalid; they can be placed only outside of a top-level parallel_for; i.e., before/after the GPU kernel dispatch. The ones you used here are inside of a kernel. (Inside of a kernel, you should instead use team.team_barrier().)
It looks to me like the fences are like so:
Kokkos::fence();
Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev_pack), [&] (const Int& k) {
...
});
Kokkos::fence();
No, TeamThreadRange means you're inside a kernel. Look for KOKKOS_LAMBDA or a team policy as the first arg to parallel_for.
Thanks. After some initial excitement that we might have found something, I appear to no longer be seeing large temperatures in the output for ne30. In fact, the May 26th repo (that I was using for perf benchmarks) is not showing the large temperatures. However, I do get a runtime error with ne120 which I had assumed was because of this issue, but perhaps it is something new. With ne120, I can set NCPL=288 and run one day. It may be we can close this issue as I seem unable to repeat. Unfortunate that we don't know what change allowed this -- but it could also be something that is still outstanding and is just not being tripped.
This was corrected a long time ago by Andrew -- indeed there was a race condition in kokkos block.
After #1555, all our runs are crashing. This git issue is a place to document our efforts to identify/fix this.