E3SM-Project / scream

Exascale global atmosphere model written in C++ as part of the E3SM project
https://e3sm-project.github.io/scream/
Other
70 stars 48 forks source link

GPU nondeterminism #1752

Closed ambrad closed 2 years ago

ambrad commented 2 years ago

We should understand the sources of nondeterminism between runs with the same PE layout on the GPU. The dycore, SHOC, and P3 are all meant to be BFB deterministic. SHOC and P3 should be invariant to PE layout, and the dycore can be made so easily by keeping the team sizes fixed independently of number of elements on the GPU. rrtmgp is known to have some nondeterminism.

To start this characterization, I ran SCRAM at dd6e3983cf27fb5298f4418731c318a0aa1f72fe with the following rrtmgp diffs, the only known sources of nondeterminism, on Summit:

diff --git a/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp b/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp                                                                                                                              
index 20bd9f7..ca0cfce 100644                                                                                                                                                                                                         
--- a/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp                                                                                                                                                                                    
+++ b/cpp/rrtmgp/kernels/mo_gas_optics_kernels.cpp                                                                                                                                                                                    
@@ -237,7 +237,8 @@ void gas_optical_depths_minor(int max_gpt_diff, int ncol, int nlay, int ngpt, in                                                                                                                                  
   // for (int ilay=1; ilay<=nlay; ilay++) {                                                                                                                                                                                          
   //   for (int icol=1; icol<=ncol; icol++) {                                                                                                                                                                                        
   //     for (int igpt0=0; igpt0<=max_gpt_diff; igpt0++) {                                                                                                                                                                           
-  parallel_for( Bounds<3>(nlay,ncol,{0,max_gpt_diff}) , YAKL_DEVICE_LAMBDA (int ilay, int icol, int igpt0) {                                                                                                                         
+  parallel_for( Bounds<2>(nlay,ncol) , YAKL_DEVICE_LAMBDA (int ilay, int icol) {                                                                                                                                                     
+      for (int igpt0=0; igpt0<=max_gpt_diff; igpt0++) {                                                                                                                                                                              
     // This check skips individual columns with no pressures in range                                                                                                                                                                
     //                                                                                                                                                                                                                               
     if ( layer_limits(icol,1) <= 0 || ilay < layer_limits(icol,1) || ilay > layer_limits(icol,2) ) {                                                                                                                                 
@@ -292,10 +293,12 @@ void gas_optical_depths_minor(int max_gpt_diff, int ncol, int nlay, int ngpt, in                                                                                                                                
                                           jeta.slice<1>(COLON,iflav,icol,ilay), myjtemp, nminork, neta, ntemp);                                                                                                                      
           tau_minor = kminor_loc * scaling;                                                                                                                                                                                          

-          yakl::atomicAdd( tau(igpt,ilay,icol) , tau_minor );                                                                                                                                                                        
+          //yakl::atomicAdd( tau(igpt,ilay,icol) , tau_minor );                                                                                                                                                                      
+          tau(igpt,ilay,icol) += tau_minor;                                                                                                                                                                                          
         }  // igpt <= gptE                                                                                                                                                                                                           
       }                                                                                                                                                                                                                              
     }                                                                                                                                                                                                                                
+      }                                                                                                                                                                                                                              
   });                                                                                                                                                                                                                                
 }                                                                                                                                                                                                                                    

diff --git a/cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp b/cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp                                                                                                                        
index 618dc4e..aa756c6 100644                                                                                                                                                                                                         
--- a/cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp                                                                                                                                                                                 
+++ b/cpp/rte/kernels/mo_fluxes_broadband_kernels.cpp                                                                                                                                                                                 
@@ -25,9 +25,12 @@ void net_broadband(int ncol, int nlev, int ngpt, real3d const &spectral_flux_dn,                                                                                                                                   
   // do igpt = 2, ngpt                                                                                                                                                                                                               
   //   do ilev = 1, nlev                                                                                                                                                                                                             
   //     do icol = 1, ncol                                                                                                                                                                                                           
-  parallel_for( Bounds<3>({2,ngpt},nlev,ncol) , YAKL_DEVICE_LAMBDA (int igpt, int ilev, int icol) {                                                                                                                                  
+  parallel_for( Bounds<2>(nlev,ncol) , YAKL_DEVICE_LAMBDA (int ilev, int icol) {                                                                                                                                                     
+      for (int igpt=2; igpt <= ngpt; ++igpt) {                                                                                                                                                                                       
     real diff = spectral_flux_dn(icol, ilev, igpt) - spectral_flux_up(icol, ilev, igpt);                                                                                                                                             
-    yakl::atomicAdd( broadband_flux_net(icol,ilev) , diff );                                                                                                                                                                         
+    //yakl::atomicAdd( broadband_flux_net(icol,ilev) , diff );                                                                                                                                                                       
+    broadband_flux_net(icol,ilev) += diff;                                                                                                                                                                                           
+      }                                                                                                                                                                                                                              
   });                                                                                                                                                                                                                                
 #ifdef RRTMGP_DEBUG                                                                                                                                                                                                                  
   std::cout << "WARNING: THIS ISN'T TESTED!\n";            

Despite these modifications, the runs are not deterministic:

[ambradl@login3.summit run]$ zgrep "qv(  1)=\|nstep=" homme_atm.log.2173979.220617-000108.gz 
 nstep=           0  time=   0.0000000000000000       [s]
 qv(  1)=   0.1490436289941499E-06  0.2293270640075207E-01  0.3130485544141958E+04
 nstep=           6  time=   1800.0000000000000       [s]
 qv(  1)=   0.1838597635350173E-06  0.2255596100177682E-01  0.3130485544141958E+04
 nstep=          12  time=   3600.0000000000000       [s]
 qv(  1)=   0.4212306975874426E-06  0.2237589178886836E-01  0.3126929022576206E+04
 nstep=         480  time=   1.6666666666666667       [day]
 qv(  1)=   0.4163120738625255E-06  0.2209594759162743E-01  0.3004260710577097E+04
[ambradl@login3.summit run]$ zgrep "qv(  1)=\|nstep=" homme_atm.log.2174103.220617-020643.gz
 nstep=           0  time=   0.0000000000000000       [s]
 qv(  1)=   0.1490436289941499E-06  0.2293270640075207E-01  0.3130485544141958E+04
 nstep=           6  time=   1800.0000000000000       [s]
 qv(  1)=   0.1838597635350173E-06  0.2255596100177682E-01  0.3130485544141958E+04
 nstep=          12  time=   3600.0000000000000       [s]
 qv(  1)=   0.4212306975874426E-06  0.2237589178886836E-01  0.3126929022576206E+04
 nstep=         480  time=   1.6666666666666667       [day]
 qv(  1)=   0.4341162203363551E-06  0.2213253773697735E-01  0.3003957428214704E+04

This diff will work for others once https://github.com/E3SM-Project/E3SM/pull/5031 goes into the upstream repo and then is merged to SCREAM.

ambrad commented 2 years ago

The source of nondeterminism seems to be rrtmgp, even though it's not the atmoicAdds above. If I use the following diff that turns off rrtmgp:

diff --git a/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp b/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp                                                                                            
index c35cf94e2e..35c0998275 100644                                                                                                                                                                                                   
--- a/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp                                                                                                                                                                   
+++ b/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp                                                                                                                                                                   
@@ -301,6 +301,7 @@ void RRTMGPRadiation::initialize_impl(const RunType /* run_type */) {                                                                                                                                             
 // =========================================================================================                                                                                                                                         

 void RRTMGPRadiation::run_impl (const int dt) {                                                                                                                                                                                      
+  return;                                                                                                                                                                                                                            
   using PF = scream::PhysicsFunctions<DefaultDevice>;                                                                                                                                                                                
   using PC = scream::physics::Constants<Real>;                                                                                                                                                                                       
   using CO = scream::ColumnOps<DefaultDevice,Real>;

I then get BFB runs:

ambradl@login3.summit run]$ zgrep "qv(  1)=" homme_atm.log.2178405.220619-160325.gz
 qv(  1)=   0.1490436289941499E-06  0.2293270640075207E-01  0.3130485544141958E+04
 qv(  1)=   0.2113742820193549E-06  0.2262474993502463E-01  0.3130485544145857E+04
 qv(  1)=   0.2432119407492651E-06  0.2236962195304372E-01  0.3130485544150322E+04
 qv(  1)=   0.4206572532992378E-06  0.2220408944867455E-01  0.3130485544155032E+04
 qv(  1)=   0.4283653630334010E-06  0.2123625399428080E-01  0.3106513559797811E+04
 qv(  1)=   0.4497757249465867E-06  0.2153580860254133E-01  0.3085763340436063E+04
 qv(  1)=   0.4738402559871806E-06  0.2187106294372843E-01  0.3071734455223180E+04
 qv(  1)=   0.6171933688445528E-06  0.2232847960930057E-01  0.3064740058628764E+04
 qv(  1)=   0.6190056492908593E-06  0.2238082453929494E-01  0.3059894092129151E+04
 qv(  1)=   0.6206682027816043E-06  0.2250325424226907E-01  0.3054581920565226E+04
 qv(  1)=   0.6032523849546201E-06  0.2290730840012553E-01  0.3053110248009231E+04
 qv(  1)=   0.6245767689662035E-06  0.2357709032664381E-01  0.3053225955153207E+04
 qv(  1)=   0.6274433955481660E-06  0.2364467303280582E-01  0.3054660628148934E+04
 qv(  1)=   0.6065759292104155E-06  0.2421251672077098E-01  0.3052568069941964E+04
 qv(  1)=   0.5373375583983446E-06  0.2440863011897972E-01  0.3049473040753835E+04
 qv(  1)=   0.4162637041925488E-06  0.2466026805815246E-01  0.3051294380822904E+04
[ambradl@login3.summit run]$ zgrep "qv(  1)=" homme_atm.log.2178413.220619-162124.gz
 qv(  1)=   0.1490436289941499E-06  0.2293270640075207E-01  0.3130485544141958E+04
 qv(  1)=   0.2113742820193549E-06  0.2262474993502463E-01  0.3130485544145857E+04
 qv(  1)=   0.2432119407492651E-06  0.2236962195304372E-01  0.3130485544150322E+04
 qv(  1)=   0.4206572532992378E-06  0.2220408944867455E-01  0.3130485544155032E+04
 qv(  1)=   0.4283653630334010E-06  0.2123625399428080E-01  0.3106513559797811E+04
 qv(  1)=   0.4497757249465867E-06  0.2153580860254133E-01  0.3085763340436063E+04
 qv(  1)=   0.4738402559871806E-06  0.2187106294372843E-01  0.3071734455223180E+04
 qv(  1)=   0.6171933688445528E-06  0.2232847960930057E-01  0.3064740058628764E+04
 qv(  1)=   0.6190056492908593E-06  0.2238082453929494E-01  0.3059894092129151E+04
 qv(  1)=   0.6206682027816043E-06  0.2250325424226907E-01  0.3054581920565226E+04
 qv(  1)=   0.6032523849546201E-06  0.2290730840012553E-01  0.3053110248009231E+04
 qv(  1)=   0.6245767689662035E-06  0.2357709032664381E-01  0.3053225955153207E+04
 qv(  1)=   0.6274433955481660E-06  0.2364467303280582E-01  0.3054660628148934E+04
 qv(  1)=   0.6065759292104155E-06  0.2421251672077098E-01  0.3052568069941964E+04
 qv(  1)=   0.5373375583983446E-06  0.2440863011897972E-01  0.3049473040753835E+04
 qv(  1)=   0.4162637041925488E-06  0.2466026805815246E-01  0.3051294380822904E+04
ambrad commented 2 years ago

Ah, there are some atomicAdds in the scream interface to rrtmgp. I'll see if I can replace these and will report back.

ambrad commented 2 years ago

Ok, if I use this diff in addition to the one to the external rrtmgp posted in the first comment:

diff --git a/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp b/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp                                                                                      
index 78464e97c8..4145ff3df4 100644                                                                                                                                                                                                   
--- a/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp                                                                                                                                                                
+++ b/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp                                                                                                                                                                
@@ -161,8 +161,8 @@ namespace scream {                                                                                                                                                                                                
             // Threshold between visible and infrared is 0.7 micron, or 14286 cm^-1.                                                                                                                                                 
             const real visible_wavenumber_threshold = 14286;                                                                                                                                                                         
             auto wavenumber_limits = k_dist_sw.get_band_lims_wavenumber();                                                                                                                                                           
-            parallel_for(Bounds<2>(nswbands, ncol), YAKL_DEVICE_LAMBDA(const int ibnd, const int icol) {                                                                                                                             
-                                                                                                                                                                                                                                     
+            parallel_for(Bounds<1>(ncol), YAKL_DEVICE_LAMBDA(const int icol) {                                                                                                                                                       
+                for (int ibnd=1; ibnd<=nswbands; ++ibnd) {                                                                                                                                                                           
                 // Wavenumber is in the visible if it is above the visible wavenumber                                                                                                                                                
                 // threshold, and in the infrared if it is below the threshold                                                                                                                                                       
                 const bool is_visible_wave1 = (wavenumber_limits(1, ibnd) > visible_wavenumber_threshold ? true : false);                                                                                                            
@@ -171,23 +171,24 @@ namespace scream {                                                                                                                                                                                              
                 if (is_visible_wave1 && is_visible_wave2) {                                                                                                                                                                          

                     // Entire band is in the visible                                                                                                                                                                                 
-                    yakl::atomicAdd(sfc_flux_dir_vis(icol), sw_bnd_flux_dir(icol,ktop,ibnd));                                                                                                                                        
-                    yakl::atomicAdd(sfc_flux_dif_vis(icol), sw_bnd_flux_dif(icol,ktop,ibnd));                                                                                                                                        
+                    /*yakl::atomicAdd(*/sfc_flux_dir_vis(icol)/*, */+=sw_bnd_flux_dir(icol,ktop,ibnd);//);                                                                                                                           
+                    /*yakl::atomicAdd(*/sfc_flux_dif_vis(icol)/*, */+=sw_bnd_flux_dif(icol,ktop,ibnd);//);                                                                                                                           

                 } else if (!is_visible_wave1 && !is_visible_wave2) {                                                                                                                                                                 

                     // Entire band is in the longwave (near-infrared)                                                                                                                                                                
-                    yakl::atomicAdd(sfc_flux_dir_nir(icol), sw_bnd_flux_dir(icol,ktop,ibnd));                                                                                                                                        
-                    yakl::atomicAdd(sfc_flux_dif_nir(icol), sw_bnd_flux_dif(icol,ktop,ibnd));                                                                                                                                        
+                    /*yakl::atomicAdd(*/sfc_flux_dir_nir(icol)/*, */+=sw_bnd_flux_dir(icol,ktop,ibnd);//);                                                                                                                           
+                    /*yakl::atomicAdd(*/sfc_flux_dif_nir(icol)/*, */+=sw_bnd_flux_dif(icol,ktop,ibnd);//);                                                                                                                           

                 } else {                                                                                                                                                                                                             

                     // Band straddles the visible to near-infrared transition, so put half                                                                                                                                           
                     // the flux in visible and half in near-infrared fluxes                                                                                                                                                          
-                    yakl::atomicAdd(sfc_flux_dir_vis(icol), 0.5 * sw_bnd_flux_dir(icol,ktop,ibnd));                                                                                                                                  
-                    yakl::atomicAdd(sfc_flux_dif_vis(icol), 0.5 * sw_bnd_flux_dif(icol,ktop,ibnd));                                                                                                                                  
-                    yakl::atomicAdd(sfc_flux_dir_nir(icol), 0.5 * sw_bnd_flux_dir(icol,ktop,ibnd));                                                                                                                                  
-                    yakl::atomicAdd(sfc_flux_dif_nir(icol), 0.5 * sw_bnd_flux_dif(icol,ktop,ibnd));                                                                                                                                  
+                    /*yakl::atomicAdd(*/sfc_flux_dir_vis(icol)/*, */+=0.5 * sw_bnd_flux_dir(icol,ktop,ibnd);//);                                                                                                                     
+                    /*yakl::atomicAdd(*/sfc_flux_dif_vis(icol)/*, */+=0.5 * sw_bnd_flux_dif(icol,ktop,ibnd);//);                                                                                                                     
+                    /*yakl::atomicAdd(*/sfc_flux_dir_nir(icol)/*, */+=0.5 * sw_bnd_flux_dir(icol,ktop,ibnd);//);                                                                                                                     
+                    /*yakl::atomicAdd(*/sfc_flux_dif_nir(icol)/*, */+=0.5 * sw_bnd_flux_dif(icol,ktop,ibnd);//);                                                                                                                     
+                }                                                                                                                                                                                                                    
                 }                                                                                                                                                                                                                    
             });                                                                                                                                                                                                                      
         }

then the runs are deterministic on Summit with the optimized build.

IMO, we should provide a configuration flag that mandates determinism. In fact, I suggest two flags, as follows:

Here I've emphasized optimized build. Of course the debug build will provide the same. A subcomponent is free to implement only the second, since it implies the first. If a subcomponent refuses to following one or both flags, then it should provide a compile-time error saying so. That way, in the future, we can at least support builds with a subset of subcomponents that follow these flags. IMO, this will be important for finding and debugging subtle bugs, just like in mainstream E3SM.

Thoughts, @bartgol, @AaronDonahue, @brhillman, @PeterCaldwell?

PeterCaldwell commented 2 years ago

Thanks for this sleuthing, @ambrad . Yes, I agree that maintaining BFB reproducibility is important and since fully-optimized GPU runs can't be BFB we will need a flag to enable this mode. I had the vague notion that we would tie BFB with the debug-compile flag, but I see now that would be a weak form of protection - BFB optimized builds would leave less room for our opt builds to have problems. I'm unclear why we would want RUN_TO_RUN BFB instead of always just testing PE_LAYOUT BFBness. Can you provide a use case of this type?

If a subcomponent refuses to following one or both flags, then it should provide a compile-time error saying so

Are you imagining an assert checking that each process knows what these flags mean? I can't quite visualize how this would work. I would think that if a component didn't implement the non-bfb mode, we'd quickly see that the model wasn't bfb any more?

I'm also a bit unclear what needs to be done to implement this desired flag. Just wait for https://github.com/E3SM-Project/E3SM/pull/5031, then add EAMXX_BFB... flag and put the above diffs in "if EAMXX_BFB" blocks?

ambrad commented 2 years ago
  1. Re: "since fully-optimized GPU runs can't be BFB": That may or may not be true. It wouldn't surprise me if the changes to rrtmgp have little effect on performance. P3, SHOC, and the dycore all by design are BFB run to run.
  2. The dycore currently only is BFB run to run because team sizes depend on the number of elements on the GPU. Fixing the team sizes independently of number of elements per GPU would then make the dycore also invariant to PE layout.
  3. This use case is not that important, so I agree that supporting just EAMXX_BFB_PE_LAYOUT is fine.
  4. Re: throwing a build error, I just meant that if we find a subcomponent doesn't support the mode and isn't BFB run to run (or PE-layout invariant if that's all we have a flag for), then we'd insert code along these lines:
    #ifdef EAMXX_BFB_PE_LAYOUT
    "Build time error in subcomponent X: Cannot build in BFB mode."
    #endif

    So then there would be no mystery in the future. But as long as we're in charge of this stuff, I believe we can always help the subcomponent devs get that mode implemented.

  5. PR 5031 is orthogonal. I'm just using dycore diagnostic output as a means of assessing BFBness rather than netcdf file output.
  6. We need to do this:
    1. Ask to put a commit in the external rrtmgp repo for the diff in the first comment in this issue, based on a flag whose name is consistent with falgs in that repo.
    2. Commit the second diff to our repo using the EAMXX_BFB_PE_LAYOUT flag.
    3. Add a CMake flag to the SCREAM CMake system.
    4. Make some changes to the team policies in the dycore to support EAMXX_BFB_PE_LAYOUT in addition to the current implicit support for the (not to be used) EAMXX_BFB_RUN_TO_RUN flag.
bartgol commented 2 years ago

I think the two config options that Andrew proposed make a lot of sense, and are very verbose. I think having CMake vars that can be enabled/disabled make things much clearer. In view of EAMxx (and v4 development), this will allow parametrizations developers to clearly state whether they support BFB-ness for different PE layouts. In that regard, I would move the check from build to config time (i.e., when cmake runs, rather than make/ninja).

ambrad commented 2 years ago

I'm going to take this. Details after meeting discussion: