E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
77 stars 55 forks source link

In Perlmutter opt builds, the whole planet warms really rapidly #1557

Closed PeterCaldwell closed 2 years ago

PeterCaldwell commented 2 years ago

After #1555, all our runs are crashing. This git issue is a place to document our efforts to identify/fix this.

PeterCaldwell commented 2 years ago

As a sanity check, below are plots of ne120 before and after the omega fix. Compare in particular the colorbars for the bottom left panels (near-surf T at the end of the run)... before the omega fix, max is ~304K and afterwards it is 419K. Note as well the top left panel. This shows the timeseries (along the x axis) of the vertical profiles of global-ave T. First, note that both runs are approximately equal in length (x axis units is fraction of a day). Note as well that the post-omega fix has huge lowest-level T basically immediately and that huge value persists for the rest of the run. ne120_T_mid_pre-omegafix.pdf ne120_T_mid_post-omegafix.pdf

Of course, we don't know that it was the omega PR that caused this change because we had been using an older tag for eval, then jumped to the post-omega version. We need to bisect to confirm that omega was the source of trouble.

bartgol commented 2 years ago

Yeah, I think we should try the version right before the omega PR merge. Then we'll know if and how the change contributed to the bad T profile.

PeterCaldwell commented 2 years ago

Update: If I understand correctly, both of Noel's ne120 runs I analyze above were from the same repo and only differed in whether they had the omega bugfix tacked on. Which makes it sound like our bisection is done - it must be the omega bugfix. This is inconsistent, however, with a run Chris T just did using master from Apr 11th (way before the omega bugfix) which also seemed to get hot.

Another weird thing is that all of these hot runs were compiled in optimized mode. I figured this wouldn't matter, but when I analyzed a run Noel did with the bugfix but compiled in debug mode, the simulation ran longer and did not get really hot! ne120_T_mid_post-omegafix-debug-compile.pdf

bartgol commented 2 years ago

I'm assuming Noel runs are on GPU. Are Christ runs on CPU? In other words, does the arch seem to impact things?

singhbalwinder commented 2 years ago

I found something similar with the clubb code with the intel compiler on some architectures. A newer version fixed that issue. I tracked it down (using perturbation growth test) and it was indeed a compiler bug (More details here). Perhaps, we can use a newer version of the compiler to see if the problem goes away.

PeterCaldwell commented 2 years ago

Interesting comments, you two. I think both @crterai and @ndkeen are both using Perlmutter GPUs for their runs. I think Chris was going to try running on Cori - he would need to turn off SPA to do that to avoid OOM - but I'm not sure what the results of that are...

crterai commented 2 years ago

My simulations that get hot have been on GPUs as well. I do find it odd that the simulations with DEBUG compile seem to run longer. Mine have all not been debug compile.

bartgol commented 2 years ago

We should fast-track the SPA pr perhaps, and try on CPU as well. That might tell us if the issue is related to the backend.

Though we should figure out the discrepancy between Noel and Chris regarding when the code started to go bananas.

PeterCaldwell commented 2 years ago

I just ran ~6 hrs of ne30 with optimized build on quartz. Max T is 350K (over Australia) by the end of the run. This seems a bit high, but is nothing like what we're seeing on perlmutter and the run continued 4x longer than Chris has managed so far. So I think we can conclude there's probably something wrong with Perlmutter. It would be good to verify on Cori.

I should mention that this run does crash eventually due to qv [not] within interval [1e-13, 0.2] in microphysics.

crterai commented 2 years ago

I can confirm that on cori, the simulation doesn't get ridiculously hot as it did on PM. The simulation still crashes for a different reason around the same time the others have crashed (1hr30min).

474: WARNING: Post-condition property check failed and repaired.
474:   - Property check name: CheckAndRepairWrapper
474:   - Check : tracers lower bound check: -1e-17
474:   - Repair: Positivity check for field tracers
474: 
474:   - Atmosphere process name: Dynamics
474:   - Atmosphere process MPI Rank: 474
474: forrtl: error (65): floating invalid
474: Image              PC                Routine            Line        Source             
474: e3sm.exe           0000000008493324  Unknown               Unknown  Unknown
474: e3sm.exe           0000000006CAF1D0  Unknown               Unknown  Unknown
474: e3sm.exe           00000000055CCC2A  shr_const_mod_mp_          94  shr_const_mod.F90
474: e3sm.exe           000000000542E8E8  mct_mod_mp_mct_av        1025  mct_mod.F90
474: e3sm.exe           000000000047C5E5  component_mod_mp_         760  component_mod.F90
474: e3sm.exe           000000000043B48C  cime_comp_mod_mp_        3085  cime_comp_mod.F90
474: e3sm.exe           000000000046354C  MAIN__                    153  cime_driver.F90
474: e3sm.exe           00000000004020F2  Unknown               Unknown  Unknown
474: e3sm.exe           0000000008A3E34F  Unknown               Unknown  Unknown
474: e3sm.exe           0000000000401FDA  Unknown               Unknown  Unknown
srun: error: nid02620: task 474: Aborted
ndkeen commented 2 years ago

I verified that with a master of April 27th, I also see this issue. With ne30, fully OPT builds will run for 14 steps and stop with

1: terminate called after throwing an instance of 'std::logic_error'
1:   what():  /global/cfs/cdirs/e3sm/ndk/se02-apr27/components/scream/src/share/atm_process/atmosphere_process.cpp:282: FAIL:
1: false
1: Error! Failed post-condition check (cannot be repaired).
1:   - Atm process name: Dynamics
1:   - Property check name: T_mid within interval [140, 500]
1:   - Atmosphere process MPI Rank: 1
1:   - Error message: FieldWithinIntervalCheck failed; min = 188.589752; max = 523.822598

Chris T verified it looks like the same too hot issue: https://portal.nersc.gov/cfs/e3sm/terai/SCREAM/v1_analysis/ne30_v1_PM_quicktest_220428.pdf

PeterCaldwell commented 2 years ago

Implementing #1572 (UVM/run select kernels on host) would be useful for debugging this problem, which raises it in my priority...

bartgol commented 2 years ago

@PeterCaldwell I am working on it in EKAT. So far it seems like it's going well. I hope to finish it tomorrow.

ndkeen commented 2 years ago

When we build DEBUG, the results seem OK. When we build OPT, it will still run (mileage may vary), but the results quickly look wild. I have spent a lot of time trying to figure out what's happening and after a lot of trial-n-error, I have found that I can get the same behavior as the DEBUG runs by turning on certain KOKKOS debug macros.

      set(Kokkos_ENABLE_DEBUG TRUE CACHE BOOL "")
      set(Kokkos_ENABLE_DEBUG_BOUNDS_CHECK TRUE CACHE BOOL "")

If I add those 2 lines for CUDA/OPT builds, it seems to run OK on PM. And at about half the speed we expect (compared to quite a bit slower with full DEBUG). Note this is otherwise optimized build -- ie this is the ONLY thing needed (other than I also need the same work-around for the team size too large issue https://github.com/E3SM-Project/scream/issues/1485).

Note that the BOUNDS_CHECK is really what matters here, but I have to also set the first macro to get around various team size too large issues.

Chatting with C Trott, he suggested this could be a race condition as it is essentially slowing down the GPU's.

I spent some time looking closer at how those flags are used within KOKKOS and learned a little bit more, but ultimately had to give up trying to find a change that has lower impact.

If I could find a way to alter each parallel section to be "safer" in some way, I might be able to rule out race condition or narrow down location.

@ndkeen

ambrad commented 2 years ago

Could be missing team_barrier or fence calls. Given that the unit tests for the low-level components pass on weaver in the release build, the likeliest spots missing such barriers are the interfaces.

ndkeen commented 2 years ago

Is there something I can add after various kokkos sections to see if it helps?

ambrad commented 2 years ago

Well, outside of top-level parallelfor/reduce/scans you can add Kokkos::fence(). For completeness, put one call before and one call after the parallel*.

Inside of a kernel -- that is, inside the top-level parallel_* -- you can write team.team_barrier() before and after each TeamThreadRange loop.

ndkeen commented 2 years ago

I've been adding fences before/after any parallel_ regions. Within P3 I did that and did not see any changes. And within shoc, I've been getting various runtime fails. I'm doing:

  Kokkos::fence();
  Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev_pack), [&] (const Int& k) {
     ...
  });
  Kokkos::fence();

why would adding these cause runtime fails in certain locations?

ambrad commented 2 years ago

What are the runtime fail messages? I agree with you that it seems that putting fences right after each kernel should not cause a failure. But maybe it's changing the solution, which then causes a failure.

ndkeen commented 2 years ago

While some sets of files including fences will run (though T still too large), others will fail with either:

0: :0: : block: [284,0,0], thread: [0,32,0] Assertion `Number of arguments passed to Kokkos::View() constructor must match the dynamic rank of the view.` failed.

or

0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():  cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access was encountered /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:150
0: Traceback functionality not available
...
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:964
0: #15  0x19b96be in _ZN6Kokkos4Impl16ExecSpaceManager12static_fenceERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:135
0: #16  0x19b96be in _ZN6Kokkos4Impl48_GLOBAL__N__38a2ddd6_15_Kokkos_Core_cpp_889c95a614fence_internalERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:568
0: #17  0x19b96be in _ZN6Kokkos5fenceEv
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/externals/ekat/extern/kokkos/core/src/impl/Kokkos_Core.cpp:1097
0: #18  0x17e82e8 in _ZN6scream4shoc9FunctionsIdN6Kokkos6DeviceINS2_4CudaENS2_9CudaSpaceEEEE9shoc_mainERKiS9_S9_S9_S9_S9_RKdRKN4ekat16WorkspaceManagerINSC_4PackIdLi1EEES6_EERKNS7_9SHOCInputERKNS7_15SHOCInputOutputERKNS7_10SHOCOutputERKNS7_17SHOCHistoryOutputE
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/components/scream/src/physics/shoc/shoc_main_impl.hpp:380
0: #19  0x17d517a in _ZN6scream16SHOCMacrophysics8run_implEi
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/components/scream/src/physics/shoc/atmosphere_macrophysics.cpp:411
0: #20  0x18ebcfc in _ZN6scream17AtmosphereProcess3runEi
0:      at /global/cfs/cdirs/e3sm/ndk/se04-may24/components/scream/src/share/atm_process/atmosphere_process.cpp:70
...
ambrad commented 2 years ago

That's a Kokkos View usage error. Interesting. I wonder why we don't see this in our CPU runs. You can bisect for the incorrect usage by going into shoc_main_internal and placing a return statement at various locations until you home in on the function that is causing this error.

ambrad commented 2 years ago

It's also possible that cuda's memcheck will help pinpoint the location of the error.

ndkeen commented 2 years ago

So it would make sense that adding fences (in locations like above) would cause errors like this? Note that I'm not trying these changes on CPU, just GPU.

ambrad commented 2 years ago

Yes, a fence will halt everything and wait for the GPU to catch up. The fact that we don't see the error in a standard CPU run is puzzling, however.

ambrad commented 2 years ago

I'll note that usage of this sort in shoc_functions.cpp

    view_2d
      z_d(temp_2d_d[0]),
      cldn_d(temp_2d_d[7]);

is incorrect. Each of these need two additional args to specify the size of the view. But shoc_functions.cpp is for unit tests only.

ndkeen commented 2 years ago

OK using todays master, I added fences to one file and see the same Assert error as above.

cori08% git diff components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
diff --git a/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp b/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
index 16aaf5c659..e78cf9e2ad 100644
--- a/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
+++ b/components/scream/src/physics/shoc/shoc_tridiag_solver_impl.hpp
@@ -28,6 +28,7 @@ void Functions<S,D>::vd_shoc_decomp(

   const Int nlev_pack = ekat::npack<Spack>(nlev);

+  Kokkos::fence();//ndkf
   // Compute entries of the tridiagonal system
   Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev_pack), [&] (const Int& k) {

@@ -72,6 +73,7 @@ void Functions<S,D>::vd_shoc_decomp(
       d (range_pack[p]) = d_k [p];
     }
   });
+  Kokkos::fence();//ndkf
 }

 template<typename S, typename D>
ambrad commented 2 years ago

Those fences are invalid; they can be placed only outside of a top-level parallel_for; i.e., before/after the GPU kernel dispatch. The ones you used here are inside of a kernel. (Inside of a kernel, you should instead use team.team_barrier().)

ndkeen commented 2 years ago

It looks to me like the fences are like so:

  Kokkos::fence();
  Kokkos::parallel_for(Kokkos::TeamThreadRange(team, nlev_pack), [&] (const Int& k) {
     ...
  });
  Kokkos::fence();
ambrad commented 2 years ago

No, TeamThreadRange means you're inside a kernel. Look for KOKKOS_LAMBDA or a team policy as the first arg to parallel_for.

ndkeen commented 2 years ago

Thanks. After some initial excitement that we might have found something, I appear to no longer be seeing large temperatures in the output for ne30. In fact, the May 26th repo (that I was using for perf benchmarks) is not showing the large temperatures. However, I do get a runtime error with ne120 which I had assumed was because of this issue, but perhaps it is something new. With ne120, I can set NCPL=288 and run one day. It may be we can close this issue as I seem unable to repeat. Unfortunate that we don't know what change allowed this -- but it could also be something that is still outstanding and is just not being tripped.

ndkeen commented 2 years ago

This was corrected a long time ago by Andrew -- indeed there was a race condition in kokkos block.