E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
76 stars 54 forks source link

v1 initial run is not BFB-invariant to PE layout #1676

Closed AaronDonahue closed 1 year ago

AaronDonahue commented 2 years ago

The homme_shoc_cld_spa_p3_rrtmgp test in the SCREAM unit test suite fails when comparing solutions from running with a different number of ranks, when the grid is switched to ne4.

Note this does not have the same fails when the grid is the default ne2.

AaronDonahue commented 2 years ago

Testing master with hash: 11787720eb1936d5ee2963097885164bca30e087 (May 24th) and switching the homme_shoc_cld_spa_p3_rrtmgp test to ne4 yields

The following tests FAILED:
    202 - homme_shoc_cld_spa_p3_rrtmgp_np1_vs_np3_bfb (Failed)
    203 - homme_shoc_cld_spa_p3_rrtmgp_np1_vs_np4_bfb (Failed)

Note, this means that np1 vs. np2 passed. I confirmed this manually.

AaronDonahue commented 2 years ago

Putting the diffs to master here for reproducibility:

diff --git a/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/CMakeLists.txt b/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/CMakeLists.txt
index dd045cc..c21e536 100644
--- a/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/CMakeLists.txt
+++ b/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/CMakeLists.txt
@@ -29,7 +29,7 @@ configure_file(${CMAKE_CURRENT_SOURCE_DIR}/homme_shoc_cld_spa_p3_rrtmgp_output.y

 # Set homme's test options, so that we can configure the namelist correctly
 # Discretization/algorithm settings
-set (HOMME_TEST_NE 2)
+set (HOMME_TEST_NE 4)
 set (HOMME_TEST_LIM 9)
 set (HOMME_TEST_REMAP_FACTOR 3)
 set (HOMME_TEST_TRACERS_FACTOR 1)
@@ -64,7 +64,7 @@ set (TEST_INPUT_FILES
   init/spa_init_ne2np4.nc
   init/map_ne4np4_to_ne2np4_mono.nc
   init/spa_file_unified_and_complete_ne4_20220428.nc
-  init/homme_shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
+  init/init_ne4np4.nc
 )
 foreach (file IN ITEMS ${TEST_INPUT_FILES})
   GetInputFile(${file})
diff --git a/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/input.yaml b/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/input.yaml
index 8ad376b..ed2b7df 100644
--- a/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/input.yaml
+++ b/components/scream/tests/coupled/dynamics_physics/homme_shoc_cld_spa_p3_rrtmgp/input.yaml
@@ -10,7 +10,7 @@ Time Stepping:
   Number of Steps: ${NUM_STEPS}

 Initial Conditions:
-  Filename: ${SCREAM_DATA_DIR}/init/homme_shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
+  Filename: ${SCREAM_DATA_DIR}/init/init_ne4np4.nc #homme_shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
   Physics GLL:
     surf_latent_flux: 0.0
     surf_sens_flux: 0.0
@@ -21,7 +21,7 @@ atmosphere_processes:
   homme:
     Process Name: Homme
     Enable Precondition Checks: false
-    Vertical Coordinate Filename: ${SCREAM_DATA_DIR}/init/homme_shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
+    Vertical Coordinate Filename: ${SCREAM_DATA_DIR}/init/init_ne4np4.nc #homme_shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
     Moisture: moist
   physics:
     atm_procs_list: (mac_aero_mic,rrtmgp)
@@ -38,7 +38,7 @@ atmosphere_processes:
         Grid: Physics GLL
       spa:
         Grid: Physics GLL
-        SPA Remap File: ${SCREAM_DATA_DIR}/init/map_ne4np4_to_ne2np4_mono.nc
+        SPA Remap File: none # ${SCREAM_DATA_DIR}/init/map_ne4np4_to_ne2np4_mono.nc
         SPA Data File: ${SCREAM_DATA_DIR}/init/spa_file_unified_and_complete_ne4_20220428.nc
       p3:
         Grid: Physics GLL
diff --git a/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/CMakeLists.txt b/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/CMakeLists.txt
index 037bc29..428a663 100644
--- a/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/CMakeLists.txt
+++ b/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/CMakeLists.txt
@@ -21,7 +21,7 @@ set (TEST_INPUT_FILES
   init/shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
   init/spa_init_ne2np4.nc
   init/map_ne4np4_to_ne2np4_mono.nc
-  init/spa_file_unified_and_complete_ne4_20220428.nc
+  init/init_ne4np4.nc 
 )
 foreach (file IN ITEMS ${TEST_INPUT_FILES})
   GetInputFile(${file})
diff --git a/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/input.yaml b/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/input.yaml
index 7ce4e20..237f297 100644
--- a/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/input.yaml
+++ b/components/scream/tests/coupled/physics_only/shoc_cld_spa_p3_rrtmgp/input.yaml
@@ -25,7 +25,7 @@ atmosphere_processes:
       Grid: Point Grid
     spa:
       Grid: Point Grid
-      SPA Remap File: ${SCREAM_DATA_DIR}/init/map_ne4np4_to_ne2np4_mono.nc
+      SPA Remap File: none # ${SCREAM_DATA_DIR}/init/map_ne4np4_to_ne2np4_mono.nc
       SPA Data File: ${SCREAM_DATA_DIR}/init/spa_file_unified_and_complete_ne4_20220428.nc
   rrtmgp:
     Grid: Point Grid
@@ -34,12 +34,12 @@ atmosphere_processes:

 Grids Manager:
   Type: Mesh Free
-  Number of Global Columns:   218
+  Number of Global Columns:   866
   Number of Vertical Levels:  72  # Will want to change to 128 when a valid unit test is available.

 Initial Conditions:
   # The name of the file containing the initial conditions for this test.
-  Filename: ${SCREAM_DATA_DIR}/init/shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
+  Filename: ${SCREAM_DATA_DIR}/init/init_ne4np4.nc #shoc_cld_spa_p3_rrtmgp_init_ne2np4.nc
   Point Grid:
     Load Latitude:  true
     Load Longitude: true
bartgol commented 2 years ago

Does the test fail even with 1 timestep?

If so, does this depend on what parametrizations are used in physics? E.g., if you run 1 step of homme+p3 only, is it BFB?

Note: if 1 timestep passes, then it gets harder, cause we need the reduced process list to not crash in order to run it for 2+ timesteps.

AaronDonahue commented 2 years ago

@bartgol , I'll check this morning and respond.

In the meantime, just to note. Running just shoc_cld_spa_p3_rrtmgp on an ne4 grid passes. So there is some dependency on homme for the failure, or potentially just horizontal communication. Given the last red herring, it's possible there is some wrap-around effect happening here, but I'm pretty sure I'm setting homme to se_ne=4. I'll check one more time though.

AaronDonahue commented 2 years ago

namelist.nl for the homme+physics that fails:

cubed_sphere_map        = 0
disable_diagnostics     = .false.
dt_remap_factor         = 3
dt_tracer_factor        = 1
hypervis_order          = 2
hypervis_scaling        = 0
hypervis_subcycle       = 1
hypervis_subcycle_tom   = 0
nu                      = 7e15
nu_top                  = 2.5e5
se_ftype                = 0
se_geometry             = "sphere"
se_limiter_option       = 9
se_ne                   = 4
se_nsplit               = -1
se_partmethod           = 4
se_topology             = "cube"
se_tstep                = 300
statefreq               = 9999
theta_advect_form       = 1
theta_hydrostatic_mode  = true
tstep_type              = 5
vert_remap_q_alg        = 1
AaronDonahue commented 2 years ago

and just looking at homme output for the test and the actual simulation output confirms that it is running an ne4 case.

AaronDonahue commented 2 years ago

@bartgol , to answer your first question, it looks like the fail doesn't show up until step = 2. So all tests pass if we only take 1 step. I'll test homme+p3 next. I think we can get 2 steps without crashing.

AaronDonahue commented 2 years ago

List of tests and results: (np24 vs. np48)

  1. homme_shoc_cld_spa_p3_rrtmgp - FAIL
  2. shoc_cld_spa_p3_rrtmgp - PASS
  3. homme_shoc_cld_spa_p3 - FAIL
  4. homme_shoc_cld_spa - PASS
  5. homme_p3 - PASS
  6. homme_cld_spa_p3 - PASS
  7. homme_shoc_cld_p3 - PASS

Based on this, unfortunately, it looks like a FAIL is triggered when we use homme + mac_mic, all of them. At least we can rule out rrtmgp, but we can't isolate just one physics process.

ambrad commented 2 years ago

How many time steps are run? Perhaps you'll get more FAILs if the test runs longer.

AaronDonahue commented 2 years ago

@ambrad , I thought of that. So far I'm taking 4 steps. I can try a run with something long, like 40 steps. I'll note that all of the fails listed above fail at tstep=2, so in those cases its rather quick.

AaronDonahue commented 2 years ago

Okay, finished a 40 step run with homme_shoc_cld_spa and that also PASSES.

ambrad commented 2 years ago

I ran ne30 SCREAMv1 on 256 and 320 ranks on Chrysalis and get diffs plotted here for qv at the surface (columns are run1, run2, log10(abs(run2 - run1))): https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.ambradl/scream/ne30v1.diff/ne30v1-qv.html. Output is every time step for one day.

AaronDonahue commented 2 years ago

Digging down deeper... running homme_shoc_cld_p3 but commenting out sections of p3

  1. Don't call p3_main - PASS
  2. Call only p3_main_init in p3_main - PASS
  3. No sedimentation, so just p3_init' andp3_main_part1,2,3` - PASS
  4. only cloud sedimentation - FAIL
  5. only rain sedimentation - FAIL
  6. cloud sedimentation with generalized_sedimentation commented out - PASS So it seems to be related to the upwind scheme.
PeterCaldwell commented 2 years ago

Thanks for the plots, Andrew. I'm not sure what we can conclude from them other than differences between runs start small in a lot of different places and grow over time (as one would expect for non-bfb weather simulations). Am I missing something?

ambrad commented 2 years ago

@PeterCaldwell I was interested in the nature of the diffs: how they emerge spatially and temporally, and how much they influence the solution. I think there are interesting points in the first three rows.

ambrad commented 2 years ago

One possibility is this is a matter of uninitialized variables. Different PE layouts can reveal these.

ambrad commented 2 years ago

@AaronDonahue what exactly do you mean by "cloud sedimentation with generalized_sedimentation commented out"? To break the while loop on dt_left, we need generalized_sedimentation to compute dt_sub and then update dt_left. Thus, I tried #if 0-ing out everything but that dt calc, like this:

template <typename S, typename D>
template <int nfield>
KOKKOS_FUNCTION
void Functions<S,D>
::generalized_sedimentation (
  const uview_1d<const Spack>& rho,
  const uview_1d<const Spack>& inv_rho,
  const uview_1d<const Spack>& inv_dz,
  const MemberType& team,
  const Int& nk, const Int& k_qxtop, Int& k_qxbot, const Int& kbot, const Int& kdir, const Scalar& Co_max, Scalar& dt_left, Scalar& prt_accum,
  const view_1d_ptr_array<Spack, nfield>& fluxes,
  const view_1d_ptr_array<Spack, nfield>& Vs, // (behaviorally const)                                                                                                                                                              
  const view_1d_ptr_array<Spack, nfield>& rs)
{
  // compute dt_sub                                                                                                                                                                                                                
  EKAT_KERNEL_ASSERT(Co_max >= 0);
  const Int tmpint1 = static_cast<int>(Co_max + 1);
  const Scalar dt_sub = dt_left/tmpint1;
#if 0
  // Move bottom cell down by 1 if not at ground already                                                                                                                                                                           
  const Int k_temp = (k_qxbot == kbot) ? k_qxbot : k_qxbot - kdir;

  calc_first_order_upwind_step<nfield>(rho, inv_rho, inv_dz, team, nk, k_temp, k_qxtop, kdir, dt_sub, fluxes, Vs, rs);
  team.team_barrier();

  // accumulated precip during time step                                                                                                                                                                                           
  if (k_qxbot == kbot) {
    const auto sflux0 = scalarize(*fluxes[0]);
    prt_accum += sflux0(kbot) * dt_sub;
  }
  else {
    k_qxbot -= kdir;
  }
#endif
  // update time remaining for sedimentation                                                                                                                                                                                       
  dt_left -= dt_sub;
}

When I do this, I still see the same diff pattern.

AaronDonahue commented 2 years ago

yeah, you're right @ambrad , thanks for checking my work. There was a typo in how I commented out the generalized_sedimentation routine that effectively nullified all of cloud_sedimentation. Which points to just the interface of the sedimentation. I'll keep digging today.

AaronDonahue commented 2 years ago

Okay, I think I've narrowed it down to SPA. It's the application of prescribed_CCN in P3 that appears to make the difference. When I either set nc_activated (soon nccn) to 0 in SPA it passes, or when I simply comment out

      if (do_prescribed_CCN) {
//         nc(k).set(not_drymass, max(nc(k), nccn_prescribed(k)));
      } else if (predictNc) {
         nc(k).set(not_drymass, max(nc(k) + nc_nuceat_tend(k) * dt, 0.0));
      } else {
         // nccnst is in units of #/m3 so needs to be converted.
         nc(k).set(not_drymass, nccnst*inv_rho(k));
      }

in p3_main_impl_part1.hpp. It works.

ambrad commented 2 years ago

@AaronDonahue I tried commenting that out, and the ne30 1-day runs still diff. But if I return from P3 main immediately, it doesn't. So it is something in P3, but it seems not to be just that one line.

AaronDonahue commented 2 years ago

@ambrad , what happens if you hijack SPA and force nc_activated to be uniformly 0.0? Ala

[asdonah@blake spa]$ git diff atmosphere_prescribed_aerosol.cpp 
diff --git a/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp b/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp
index 0538e63..bbc25d1 100644
--- a/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp
+++ b/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp
@@ -218,6 +218,8 @@ void SPA::run_impl (const int dt)
   const auto& pmid_tgt = get_field_in("p_mid").get_view<const Pack**>();
   SPAFunc::spa_main(SPATimeState, pmid_tgt, m_buffer.p_mid_src,
                     SPAData_start,SPAData_end,m_buffer.spa_temp,SPAData_out);
+
+  Kokkos::deep_copy(SPAData_out.CCN3,0.0);
 }

 // =========================================================================================
ambrad commented 2 years ago

I ran a simulation with the SPA 0-copy and still see the diff pattern.

AaronDonahue commented 2 years ago

okay, then the homme+physics unit test is not sufficient for debugging this problem... Good to know. I've verified that without that change it fails, with the change it passes. So we may be dealing with a case where the unit test is missing something that the CIME cases catches

Just to humor me, what happens if we zero out all of the forcing from SPA? So adding Kokkos::deep_copy(SPAData_out.X,0.0) for X=' 'AER_G_SW, AER_SSA_SW, AER_TAU_SW and AER_TAU_LW?

AaronDonahue commented 2 years ago

or simpler would be to just turn SPA off in the scream input

AaronDonahue commented 2 years ago

There appears to be an issue with SPA's reading of data properly that I am trying to track down..

ambrad commented 2 years ago

Sorry for the delay. Based on your suggestion, I did this:

--- a/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp                                                                                                                                                          
+++ b/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp                                                                                                                                                          
@@ -164,7 +164,7 @@ void SPA::initialize_impl (const RunType /* run_type */)                                                                                                                                                       
@@ -197,6 +197,12 @@ void SPA::initialize_impl (const RunType /* run_type */)                                                                                                                                                      
   // upper bound set to 1.01 as max(g_sw)=1.00757 in current ne4 data assumingly due to remapping                                                                                                                                 
   // add an epslon to max possible upper bound of aero_ssa_sw                                                                                                                                                                     

+  Kokkos::deep_copy(SPAData_out.CCN3, 0);                                                                                                                                                                                         
+  Kokkos::deep_copy(SPAData_out.AER_G_SW, 0);                                                                                                                                                                                     
+  Kokkos::deep_copy(SPAData_out.AER_SSA_SW, 0);                                                                                                                                                                                   
+  Kokkos::deep_copy(SPAData_out.AER_TAU_SW, 0);                                                                                                                                                                                   
+  Kokkos::deep_copy(SPAData_out.AER_TAU_LW, 0);                                                                                                                                                                                   
+                                                                                                                                                                                                                                  
   add_postcondition_check<FieldWithinIntervalCheck>(get_field_out("aero_g_sw"),m_grid,0.0,1.0,true);                                                                                                                              
   add_postcondition_check<FieldWithinIntervalCheck>(get_field_out("aero_ssa_sw"),m_grid,0.0,1.0,true);                                                                                                                            
   add_postcondition_check<FieldWithinIntervalCheck>(get_field_out("aero_tau_sw"),m_grid,0.0,1.0,true);                                                                                                                            
@@ -206,6 +212,7 @@ void SPA::initialize_impl (const RunType /* run_type */)                                                                                                                                                       
 // =========================================================================================                                                                                                                                      
 void SPA::run_impl (const int dt)                                                                                                                                                                                                 
 {                                                                                                                                                                                                                                 
+  return;                                                                                                                                                                                                                         
   /* Gather time and state information for interpolation */                                                                                                                                                                       
   auto ts = timestamp();                                                                                                                                                                                                          
   /* Update the SPATimeState to reflect the current time, note the addition of dt */                                                                                                                                              
@@ -217,6 +224,8 @@ void SPA::run_impl (const int dt)                                                                                                                                                                              
   const auto& pmid_tgt = get_field_in("p_mid").get_view<const Pack**>();                                                                                                                                                          
   SPAFunc::spa_main(SPATimeState, pmid_tgt, m_buffer.p_mid_src,                                                                                                                                                                   
                     SPAData_start,SPAData_end,m_buffer.spa_temp,SPAData_out);                                                                                                                                                     
 }

But it still diffs.

To be sure I wasn't somehow checking the wrong output files or something, I retested with an immediate return from p3_main (which previously I had noted resolves the diffs), and see no diff. So my test results are very likely correct.

I'll do some bisecting inside P3 to see if I can find anything.

ambrad commented 2 years ago

Right now it's not clear to me P3 is actually the problem. I disabled code to the point that I could cause the diff or prevent it by toggling a single line of code (the nc.set line in cloud_dsd2), but I think that's a red herring; that line of code is not really the problem. I need to think about this issue some more.

ambrad commented 2 years ago

Status: I updated to today's master branch (f1c36273f0). I still see the diffs (as expected). My comparison runs are configured to do 6 time steps and use ATM_NCPL=288, dt_remap_factor=1, no process subcycling, GNU debug, pack size 1, ne30 F2010-SCREAMv1, on Chrysalis.

AaronDonahue commented 2 years ago

@ambrad , thanks for the update. I was able to re-create your fails on quartz using master from yesterday (8ba92eb350c18dea27febd5371170634c8c0f47b) running F2010-SCREAMv1 on an ne4 mesh for one day, out-of-the-box, meaning with the subcycling, etc. on 48 and 96 ranks respectively. I'm testing both intel-DEBUG and intel-OPT.

As you pointed out, I can get the PASS by commenting out a small section in P3, in my case all of p3_main_part1, but I agree this is likely a red herring. Especially because your line is in a different place than mine.

Could the location of which line works be compiler/machine related? On blake I could get it to pass by commenting out the nc_activated line inside p3_main_part1, but on quartz I need to skip p3_main_part1 altogether, and for you its the dsd2 line in cloud sedimentation.

Note, I don't think we should start chasing phantoms in the code, but maybe the compiler/machine dependency is a helpful clue.

ambrad commented 2 years ago

I suspect there is something like uninitialized workspace somewhere, causing difficulties in isolating the exact cause. (But it's probably init'ed memory from valgrind's viewpoint; I'm speculating PE-layout-dependent workspace buffer memory that has values from other processes.)

Re: machine/compiler, I switched to GNU-debug-packsize=1 in order to avoid any possible issues with Intel optimizations and vectorization.

ambrad commented 2 years ago

@AaronDonahue I went back to your comment about SPA. I realized that in my original tests of making SPA irrelevant, I was allowing nc to have possibly uninitialized values (although I'm not fully convinced of that). In any case, I've switched now to this (single) modification:

#if 1
      nc(k).set(not_drymass, nccnst*inv_rho(k));
#else
      if (do_prescribed_CCN) {
         nc(k).set(not_drymass, max(nc(k), nccn_prescribed(k)/inv_cld_frac_l(k)));
      } else if (predictNc) {
         nc(k).set(not_drymass, max(nc(k) + nc_nuceat_tend(k) * dt, 0.0));
      } else {
         // nccnst is in units of #/m3 so needs to be converted.                                                                                                                              
         nc(k).set(not_drymass, nccnst*inv_rho(k));
      }
#endif

This fixes the diffs for my short runs. I'm setting up longer runs and will report back.

ambrad commented 2 years ago

Longer runs diff. So it wasn't that.

ambrad commented 2 years ago

I wanted to double-check another thing with a clean repo. I continue to run with NCPL=288, pack size 1, and the GNU compilers.

As reported in the last message, skipping the nc modification in the do_prescribed_CCN block does not fix the diff.

Next, I ran a simulation with a return from p3_main_part1 here:

  nucleationPossible = false;
  hydrometeorsPresent = false;
  team.team_barrier();
  return; // skip most of part1

It crashes just after 15 hours with TL1_1 < 0. But as already reported, the diff is gone up to that time. Many of the microphysics fields are 0, though, so (as we already know) this doesn't isolate the bug very well.

ambrad commented 2 years ago

More observations and thoughts:

  1. Assuming I did these correctly (turning off postcondition checking and returning immediately from the run_impl routines), we can rule out RRTMGP, SHOC, SPA, and surface coupling (immediate return from do_import) with confidence.
  2. The fact that various mods to P3 make the diff go away also rules out the dycore with confidence.
  3. We ruled out sedimentation in P3 but, as far as I can tell, nothing else.
  4. P3 is the only part of the code that produces nontrivial values for the advected tracers on the physics grid except SHOC's TKE, so I don't think we can assume the bug is in P3 or the P3 interface; it could also be in the AD infrastructure, such as the grid handling or the field manager.
  5. We could rule out or isolate the bug to the P3 parameterization proper (as opposed to the P3 interface or AD) by calling the F90 P3 code from the C++ interface.
AaronDonahue commented 2 years ago

Not sure if it adds useful information, but skipping HOMME by adding a return; to the run_impl in the homme interface also switched fail to pass for me. Even just commenting out the call to prim_run_f90 which ultimately calls prim_run_subcycle was sufficient to get my tests to pass.

bartgol commented 2 years ago

Do we know the amount of non-bib ness? I'm wondering if this is due to O(eps) difference between the same DOF replicated on bordering elems.

Basically, on dyn grid, if dof N is shared across elem 1, 12, and 123, only one of the copies of that DOF will be saved to the out file. These can be non-bfb, since the MPI accumulation of the 3 contribution during DSS can happen in different order on the 3 elements. This is why, even for GLL physics grid, dynamics needs to write restart files on the full dyn grid (rather than just the unique GLL version of it).

With different npN, I suspect we can get slightly different d2p remap, in the sense that the element that is assigned the unique DOF might not be the same for np1 and np4, leading to a value saved to output nc file that is slightly different.

@ambrad Correct me if I'm wrong, but I suspect this is quite likely. Although, if this is true, then all ERP tests would be bound to fail, so maybe I'm wrong.

Note: inside each elem, the DSS is performed always in the same way. The issue is which elem is assigned the unique DOF (which might change with the number of ranks, due to different space curve split), which is what determines which DSS rearrangement is used in the IO.

PeterCaldwell commented 2 years ago

Apologies if you've already checked this and I didn't catch it - do we know whether the non-bfbness is solely in the output, or is the solution different? I think Luca's hypothesis would just change the output. You should be able to check whether the solution is BFB by checking whether homme.atm.log has BFB values for the last timestep's output...

ambrad commented 2 years ago

The diffs are in the solution state, so they are real. One sees the diffs start locally at a small magnitude and then expand in space and grow in magnitude.

3-hourly snapshots of qv at level 62 of 72 are here: https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.ambradl/scream/ne30v1.diff/ne30v1-qv.html. Only the third column, the diff, is of interest, as the diffs are too small in magnitude in this short run to see in the fields themselves, in the case of qv (left two columns). Note how the diff starts small (by hour 3) and then quickly grows.

@PeterCaldwell re: your specific question of homme_atm.log output, if I diff the "qv( 1)" lines, I see diffs starting in the 4th time step (with ATM_NCPL=288 so that dynamic time steps are 1-1 with physics time steps). This output is based on model state.

Although the tests to date don't yet support isolating the problem just to the AD, I do speculate that something in the dyn<->phys grid handling is the issue, in particular how overlapping dyn GLL columns get mapped to unique physics columns for the parameterizations phase. I speculate there could be an issue where the GLL physics global ID is not PE-invariantly mapped to a GLL dynamics global ID.

bartgol commented 2 years ago

Actually, I'm not sure my comment would only apply to the output. All d2p remaps would suffer from np4 having a different elem owner for dof XYZ.

Re-reading my post, I'm thinking this is probably not the case, or else all ERP tests would make no sense...

bartgol commented 2 years ago

I speculate there could be an issue where the dynamics GLL dynamics global ID is not PE-invariantly mapped to the GLL physics global ID.

Yes, that is a reasonable possibility. I will revisit that implementation.

bartgol commented 2 years ago

A first look at src/dynamics/homme/interface/phys_grid_mod.F90, which is where we compute physics dofs, seems to suggest that dof should be PE independent. In particular, the number of dofs on each elem and the actual dof gids seem to only depend on the elements gids, which, as far as I understand from homme's mesh/cube/dof mods, are PE independent. But since it's late here, I'll check again tomorrow.

ambrad commented 2 years ago

@AaronDonahue I confirmed in my setup what you wrote re: disabling the dycore.

ambrad commented 2 years ago

@bartgol I looked at the remap code out of curiosity. Just taking a random stab at this, I'm speculating that in create_p2d_map, the loop https://github.com/E3SM-Project/scream/blob/22add4597ba4a172e7cf30dce4636e89b4c17e64/components/scream/src/dynamics/homme/physics_dynamics_remapper.cpp#L980-L993 can assign a different dyn col to a physics col, depending on what elements are on an MPI process. Consider a physics grid point at the corner of an element, shared by four elements. Then consider what happens in these two cases: (1) a process has all four elements sharing that physics grid point; (2) a process has just one element total. To me it seems that in case 1, the chosen dynamics grid point could be in principle any of the four elements sharing that geographical point, whereas in case 2, there's only one dynamics grid point available in the search.

If that's correct, then I think a solution is to have the concept of an owning element for each physics point. One could establish the convention that the element having the lowest element GID among those sharing the physics point is the owner or, similarly, the lowest dynamics GID. Edit later: Actually, the element convention would lead to one element having just four physics columns, bad for load balancing. In EAM, I think Patrick Worley may have implemented a number of more sophisticated schemes for load balancing optimally. Once pgN is in, load balance will be trivial since only one element is associated with a pgN physics column.

bartgol commented 2 years ago

I think the owning element is dictated by dynamics (the dot mod does the resolution of shared dofs, picking the elem with smallest GID to own the dof). I'll double check this snippet, and if you're right, I suspect we can easily fix it.

bartgol commented 2 years ago

Looking at the code again, I think this is a leftover from the initial implementation of p2d and all the grid specs in scream. There was a bug we fixed quite a bit of time ago, which had scream dofs not ordered correctly, causing mismatch of dofs numbering against v0. I suspect this snippet in the remapper was relying on some assumption in the original implementation, and didn't get fixed.

bartgol commented 2 years ago

Ok, I think I see the issue (though I have no idea why all tests pass if I'm seeing well). Will report shortly.

bartgol commented 2 years ago

@ambrad nailed it when he pointed to the p2d map. The fix is not that complicated, so I should have a PR up by the end of the day (in some timezone).

AaronDonahue commented 2 years ago

Nice sleuthing! Looking forward to the PR. Thanks @ambrad and @bartgol for getting to the bottom of this.

bartgol commented 2 years ago

Sorry, but I have to cut the enthusiasm. What Andrew said was not entirely true. Here are a few key points of dofs construction:

These points ensure that, regardless of number of ranks, each physics DOF is always matched to the same dynamics column.

I verified (by enriching the SE grid with some extra data) that the map

SE_elemGID ( SE_elemLID ( dynLID ( pys gid )))

gave the same answer on 1 and 4 ranks (while I am getting diffs for the np1 vs np4 test).

Edit: there is still the chance things are not entirely correct, but it's a small chance: it is possible that while gathering dynamics dofs, I'm scanning the element list in the wrong order.

bartgol commented 2 years ago

Ok, I can confirm that the p2d map is indeed correct, despite the misleading comment at the beginning of the cpp file:

    // For each phys dofs, we find a corresponding dof in the dyn grid.
    // Notice that such dyn dof may not be unique (if phys dof is on an edge
    // of a SE element), but we don't care. We just need to find a match.
    // The BoundaryExchange already takes care of syncing all shared dyn dofs.

We do care which dynamics dof we grab, but it just so happens that we are grabbing the same one, regardless of number of ranks,