Closed ambrad closed 2 years ago
The source of nondeterminism seems to be rrtmgp, even though it's not the atmoicAdds above. If I use the following diff that turns off rrtmgp:
diff --git a/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp b/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp
index c35cf94e2e..35c0998275 100644
--- a/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp
+++ b/components/scream/src/physics/rrtmgp/atmosphere_radiation.cpp
@@ -301,6 +301,7 @@ void RRTMGPRadiation::initialize_impl(const RunType /* run_type */) {
// =========================================================================================
void RRTMGPRadiation::run_impl (const int dt) {
+ return;
using PF = scream::PhysicsFunctions<DefaultDevice>;
using PC = scream::physics::Constants<Real>;
using CO = scream::ColumnOps<DefaultDevice,Real>;
I then get BFB runs:
ambradl@login3.summit run]$ zgrep "qv( 1)=" homme_atm.log.2178405.220619-160325.gz
qv( 1)= 0.1490436289941499E-06 0.2293270640075207E-01 0.3130485544141958E+04
qv( 1)= 0.2113742820193549E-06 0.2262474993502463E-01 0.3130485544145857E+04
qv( 1)= 0.2432119407492651E-06 0.2236962195304372E-01 0.3130485544150322E+04
qv( 1)= 0.4206572532992378E-06 0.2220408944867455E-01 0.3130485544155032E+04
qv( 1)= 0.4283653630334010E-06 0.2123625399428080E-01 0.3106513559797811E+04
qv( 1)= 0.4497757249465867E-06 0.2153580860254133E-01 0.3085763340436063E+04
qv( 1)= 0.4738402559871806E-06 0.2187106294372843E-01 0.3071734455223180E+04
qv( 1)= 0.6171933688445528E-06 0.2232847960930057E-01 0.3064740058628764E+04
qv( 1)= 0.6190056492908593E-06 0.2238082453929494E-01 0.3059894092129151E+04
qv( 1)= 0.6206682027816043E-06 0.2250325424226907E-01 0.3054581920565226E+04
qv( 1)= 0.6032523849546201E-06 0.2290730840012553E-01 0.3053110248009231E+04
qv( 1)= 0.6245767689662035E-06 0.2357709032664381E-01 0.3053225955153207E+04
qv( 1)= 0.6274433955481660E-06 0.2364467303280582E-01 0.3054660628148934E+04
qv( 1)= 0.6065759292104155E-06 0.2421251672077098E-01 0.3052568069941964E+04
qv( 1)= 0.5373375583983446E-06 0.2440863011897972E-01 0.3049473040753835E+04
qv( 1)= 0.4162637041925488E-06 0.2466026805815246E-01 0.3051294380822904E+04
[ambradl@login3.summit run]$ zgrep "qv( 1)=" homme_atm.log.2178413.220619-162124.gz
qv( 1)= 0.1490436289941499E-06 0.2293270640075207E-01 0.3130485544141958E+04
qv( 1)= 0.2113742820193549E-06 0.2262474993502463E-01 0.3130485544145857E+04
qv( 1)= 0.2432119407492651E-06 0.2236962195304372E-01 0.3130485544150322E+04
qv( 1)= 0.4206572532992378E-06 0.2220408944867455E-01 0.3130485544155032E+04
qv( 1)= 0.4283653630334010E-06 0.2123625399428080E-01 0.3106513559797811E+04
qv( 1)= 0.4497757249465867E-06 0.2153580860254133E-01 0.3085763340436063E+04
qv( 1)= 0.4738402559871806E-06 0.2187106294372843E-01 0.3071734455223180E+04
qv( 1)= 0.6171933688445528E-06 0.2232847960930057E-01 0.3064740058628764E+04
qv( 1)= 0.6190056492908593E-06 0.2238082453929494E-01 0.3059894092129151E+04
qv( 1)= 0.6206682027816043E-06 0.2250325424226907E-01 0.3054581920565226E+04
qv( 1)= 0.6032523849546201E-06 0.2290730840012553E-01 0.3053110248009231E+04
qv( 1)= 0.6245767689662035E-06 0.2357709032664381E-01 0.3053225955153207E+04
qv( 1)= 0.6274433955481660E-06 0.2364467303280582E-01 0.3054660628148934E+04
qv( 1)= 0.6065759292104155E-06 0.2421251672077098E-01 0.3052568069941964E+04
qv( 1)= 0.5373375583983446E-06 0.2440863011897972E-01 0.3049473040753835E+04
qv( 1)= 0.4162637041925488E-06 0.2466026805815246E-01 0.3051294380822904E+04
Ah, there are some atomicAdds in the scream interface to rrtmgp. I'll see if I can replace these and will report back.
Ok, if I use this diff in addition to the one to the external rrtmgp posted in the first comment:
diff --git a/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp b/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp
index 78464e97c8..4145ff3df4 100644
--- a/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp
+++ b/components/scream/src/physics/rrtmgp/scream_rrtmgp_interface.cpp
@@ -161,8 +161,8 @@ namespace scream {
// Threshold between visible and infrared is 0.7 micron, or 14286 cm^-1.
const real visible_wavenumber_threshold = 14286;
auto wavenumber_limits = k_dist_sw.get_band_lims_wavenumber();
- parallel_for(Bounds<2>(nswbands, ncol), YAKL_DEVICE_LAMBDA(const int ibnd, const int icol) {
-
+ parallel_for(Bounds<1>(ncol), YAKL_DEVICE_LAMBDA(const int icol) {
+ for (int ibnd=1; ibnd<=nswbands; ++ibnd) {
// Wavenumber is in the visible if it is above the visible wavenumber
// threshold, and in the infrared if it is below the threshold
const bool is_visible_wave1 = (wavenumber_limits(1, ibnd) > visible_wavenumber_threshold ? true : false);
@@ -171,23 +171,24 @@ namespace scream {
if (is_visible_wave1 && is_visible_wave2) {
// Entire band is in the visible
- yakl::atomicAdd(sfc_flux_dir_vis(icol), sw_bnd_flux_dir(icol,ktop,ibnd));
- yakl::atomicAdd(sfc_flux_dif_vis(icol), sw_bnd_flux_dif(icol,ktop,ibnd));
+ /*yakl::atomicAdd(*/sfc_flux_dir_vis(icol)/*, */+=sw_bnd_flux_dir(icol,ktop,ibnd);//);
+ /*yakl::atomicAdd(*/sfc_flux_dif_vis(icol)/*, */+=sw_bnd_flux_dif(icol,ktop,ibnd);//);
} else if (!is_visible_wave1 && !is_visible_wave2) {
// Entire band is in the longwave (near-infrared)
- yakl::atomicAdd(sfc_flux_dir_nir(icol), sw_bnd_flux_dir(icol,ktop,ibnd));
- yakl::atomicAdd(sfc_flux_dif_nir(icol), sw_bnd_flux_dif(icol,ktop,ibnd));
+ /*yakl::atomicAdd(*/sfc_flux_dir_nir(icol)/*, */+=sw_bnd_flux_dir(icol,ktop,ibnd);//);
+ /*yakl::atomicAdd(*/sfc_flux_dif_nir(icol)/*, */+=sw_bnd_flux_dif(icol,ktop,ibnd);//);
} else {
// Band straddles the visible to near-infrared transition, so put half
// the flux in visible and half in near-infrared fluxes
- yakl::atomicAdd(sfc_flux_dir_vis(icol), 0.5 * sw_bnd_flux_dir(icol,ktop,ibnd));
- yakl::atomicAdd(sfc_flux_dif_vis(icol), 0.5 * sw_bnd_flux_dif(icol,ktop,ibnd));
- yakl::atomicAdd(sfc_flux_dir_nir(icol), 0.5 * sw_bnd_flux_dir(icol,ktop,ibnd));
- yakl::atomicAdd(sfc_flux_dif_nir(icol), 0.5 * sw_bnd_flux_dif(icol,ktop,ibnd));
+ /*yakl::atomicAdd(*/sfc_flux_dir_vis(icol)/*, */+=0.5 * sw_bnd_flux_dir(icol,ktop,ibnd);//);
+ /*yakl::atomicAdd(*/sfc_flux_dif_vis(icol)/*, */+=0.5 * sw_bnd_flux_dif(icol,ktop,ibnd);//);
+ /*yakl::atomicAdd(*/sfc_flux_dir_nir(icol)/*, */+=0.5 * sw_bnd_flux_dir(icol,ktop,ibnd);//);
+ /*yakl::atomicAdd(*/sfc_flux_dif_nir(icol)/*, */+=0.5 * sw_bnd_flux_dif(icol,ktop,ibnd);//);
+ }
}
});
}
then the runs are deterministic on Summit with the optimized build.
IMO, we should provide a configuration flag that mandates determinism. In fact, I suggest two flags, as follows:
Here I've emphasized optimized build. Of course the debug build will provide the same. A subcomponent is free to implement only the second, since it implies the first. If a subcomponent refuses to following one or both flags, then it should provide a compile-time error saying so. That way, in the future, we can at least support builds with a subset of subcomponents that follow these flags. IMO, this will be important for finding and debugging subtle bugs, just like in mainstream E3SM.
Thoughts, @bartgol, @AaronDonahue, @brhillman, @PeterCaldwell?
Thanks for this sleuthing, @ambrad . Yes, I agree that maintaining BFB reproducibility is important and since fully-optimized GPU runs can't be BFB we will need a flag to enable this mode. I had the vague notion that we would tie BFB with the debug-compile flag, but I see now that would be a weak form of protection - BFB optimized builds would leave less room for our opt builds to have problems. I'm unclear why we would want RUN_TO_RUN BFB instead of always just testing PE_LAYOUT BFBness. Can you provide a use case of this type?
If a subcomponent refuses to following one or both flags, then it should provide a compile-time error saying so
Are you imagining an assert checking that each process knows what these flags mean? I can't quite visualize how this would work. I would think that if a component didn't implement the non-bfb mode, we'd quickly see that the model wasn't bfb any more?
I'm also a bit unclear what needs to be done to implement this desired flag. Just wait for https://github.com/E3SM-Project/E3SM/pull/5031, then add EAMXX_BFB... flag and put the above diffs in "if EAMXX_BFB" blocks?
#ifdef EAMXX_BFB_PE_LAYOUT
"Build time error in subcomponent X: Cannot build in BFB mode."
#endif
So then there would be no mystery in the future. But as long as we're in charge of this stuff, I believe we can always help the subcomponent devs get that mode implemented.
I think the two config options that Andrew proposed make a lot of sense, and are very verbose. I think having CMake vars that can be enabled/disabled make things much clearer. In view of EAMxx (and v4 development), this will allow parametrizations developers to clearly state whether they support BFB-ness for different PE layouts. In that regard, I would move the check from build to config time (i.e., when cmake runs, rather than make/ninja).
I'm going to take this. Details after meeting discussion:
We should understand the sources of nondeterminism between runs with the same PE layout on the GPU. The dycore, SHOC, and P3 are all meant to be BFB deterministic. SHOC and P3 should be invariant to PE layout, and the dycore can be made so easily by keeping the team sizes fixed independently of number of elements on the GPU. rrtmgp is known to have some nondeterminism.
To start this characterization, I ran SCRAM at dd6e3983cf27fb5298f4418731c318a0aa1f72fe with the following rrtmgp diffs, the only known sources of nondeterminism, on Summit:
Despite these modifications, the runs are not deterministic:
This diff will work for others once https://github.com/E3SM-Project/E3SM/pull/5031 goes into the upstream repo and then is merged to SCREAM.