E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
79 stars 56 forks source link

3 Diagnostic fields appear to now work on weaver when output #2015

Closed AaronDonahue closed 2 years ago

AaronDonahue commented 2 years ago

This issue was discovered when developing PR #1989 . This PR also added an output stream to the homme_shoc_cld_spa_p3_rrtmgp unit test to output just all of the available diagnostics. This test failed on weaver for the three diagnostics:

We should investigate this error and fix it. Note, there were changes to the output stream in this PR so it's possible the PR itself broke these diagnostics, although it is not clear why these three would be uniquely effected.

AaronDonahue commented 2 years ago

I've confirmed we get the same behavior w/ master.

AaronDonahue commented 2 years ago

Here is a sample of the error message:

KERNEL CHECK FAILED:
   !ekat::OnGpu<typename device_%s:%u: %s: block:pace>::value || is_pow_of_2(team.team_size())
   Error! Team-level parallel_scan on CUDA only works for team sizes that are power of 2.
       You could try to reduce the team size to the previous pow of 2.
AaronDonahue commented 2 years ago

I've made some progress on tracking down the issue here. @bartgol , @tcclevenger , @ambrad or @jgfouca you may be able to offer some further insight because I think it has to do w/ team policy. The three diagnostics that fail on weaver w/ the above error all use calculate_dz from the physics functions: https://github.com/E3SM-Project/scream/blob/master/components/scream/src/share/util/scream_common_physics_functions_impl.hpp#L299

There are two other diagnostics that use the same function and they do work, atm_density and vertical_layer_thickness. Looking at the usage in both working and non-working cases there is a pattern.

WORKING: https://github.com/E3SM-Project/scream/blob/master/components/scream/src/diagnostics/atm_density.cpp#L55

  Kokkos::parallel_for("VerticalLayerThicknessDiagnostic",
                       Kokkos::RangePolicy<>(0,m_num_cols*npacks),
                       KOKKOS_LAMBDA(const int& idx) {
      const int icol  = idx / npacks;
      const int jpack = idx % npacks;
      dz(icol,jpack) = PF::calculate_dz(pseudo_density_mid(icol,jpack), p_mid(icol,jpack), T_mid(icol,jpack), qv_mid(icol,jpack));
  });

NOT-WORKING: https://github.com/E3SM-Project/scream/blob/master/components/scream/src/diagnostics/vertical_layer_interface.cpp#L59

  view_1d dz("",npacks);
  Kokkos::parallel_for("VerticalLayerInterfaceDiagnostic",
                       default_policy,
                       KOKKOS_LAMBDA(const MemberType& team) {
    const int icol = team.league_rank();
    const auto TTR = Kokkos::TeamThreadRange(team, npacks);
    Kokkos::parallel_for(Kokkos::TeamThreadRange(team, npacks), [&] (const Int& jpack) {
      dz(jpack) = PF::calculate_dz(pseudo_density_mid(icol,jpack), p_mid(icol,jpack), T_mid(icol,jpack), qv_mid(icol,jpack));
    });
    team.team_barrier();
    const auto& z_int_s = ekat::subview(z_int, icol);
    PF::calculate_z_int(team,num_levs,dz,surf_geopotential,z_int_s);
  });
AaronDonahue commented 2 years ago

I don't see anything inside the calculate_dz function that might trigger the error, but this function does call column_scan from the column_ops. https://github.com/E3SM-Project/scream/blob/master/components/scream/src/share/util/scream_column_ops.hpp#L234

AaronDonahue commented 2 years ago

Note, I was able to narrow it down to calculate_dz by simple commenting out the line and then the tests will run to completion. Also note, this is when they are used as output. The unit tests for the three diagnostics pass on weaver.

bartgol commented 2 years ago

Oh, of course: you need dz to have size (ncols,npacks). Otherwise different teams will try to write on the same pack.

AaronDonahue commented 2 years ago

ahhhh simple solution. Great, I'll give it a try

bartgol commented 2 years ago

Also, you should prob pre-allocate that view, so you don't call malloc every time the diagnostic is computed.

AaronDonahue commented 2 years ago

Also, you should prob pre-allocate that view, so you don't call malloc every time the diagnostic is computed.

view_1d dz("",npacks);

isn't that what this line does?

bartgol commented 2 years ago

To be fair, you don't need (ncol,npacks). You could use a WSM, to use less memory. But (ncols,npacks) should def work.

bartgol commented 2 years ago

Also, you should prob pre-allocate that view, so you don't call malloc every time the diagnostic is computed.

view_1d dz("",npacks);

isn't that what this line does?

That allocates the view at every call. I meant to allocate it at init time, and store it in the class.

AaronDonahue commented 2 years ago

ahhh okay

tcclevenger commented 2 years ago

Interesting that this only failed on weaver.. was it run on the cpu machines?

bartgol commented 2 years ago

Interesting that this only failed on weaver.. was it run on the cpu machines?

I think on CPU we never exercise threads, except for p3/shoc functions. So on CPU, this is equiv to a single slot WSM.

AaronDonahue commented 2 years ago

Solved w/ #2023