Closed worleyph closed 6 years ago
We should also check to see if this is a problem on other platforms. @ndkeen can you check cori-knl and @amametjanov can you check theta? You could try PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR .
No problems on Theta (with git master v1.0.0-beta.3-2399-g03b8e4ad3
):
> ./cs.status.20180402_173214_g7dhdr
PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel (Overall: PASS) details:
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel CREATE_NEWCASE
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel XML
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel SETUP
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel SHAREDLIB_BUILD time=337
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel MODEL_BUILD time=508
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel SUBMIT
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel RUN time=528
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel COMPARE_base_single_thread
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel MEMLEAK insuffiencient data for memleak test
Result from Cori-KNL is also a PASS (with git master v1.0.0-beta.3-2399-g03b8e4ad3
)
> ./cs.status.20180402_103710_derqdb
PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel (Overall: PASS) details:
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel CREATE_NEWCASE
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel XML
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel SETUP
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel SHAREDLIB_BUILD time=682
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel MODEL_BUILD time=1440
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel SUBMIT
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel RUN time=631
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel COMPARE_base_single_thread
PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel MEMLEAK insuffiencient data for memleak test
For Titan/PGI, just checked both
--compset FC5AV1C-L --res ne4_ne4
and
--compset A_WCYCL1850S --res ne30_oECv3_ICG
and they were both reproducible with the same changes to env_mach_pes.xml for which
-compset A_WCYCL1950S_CMIP6_LR -res ne30_oECv3_ICG
was not reproducible (changing threading from 1 to 2 in ATM only, with BUILD_THREADED=TRUE). So, something particular to this compset is causing the problem.
Hi @worleyph - this is alarming. Another compset to try would be A_WCYCL1850S_CMIP6. This is what we're using for the CMIP6 "deck" runs and also exercised identical code to the 1950 compset until changes went onto master last Tuesday which added some more differences.
@PeterCaldwell , this is just a Titan issue, and only for PGI. I would not be concerned. But thanks for the tip.
@PeterCaldwell , since you are "listening", when comparing env_build.xml for the two compsets (A_WCYCL1850S and A_WCYCL1950S_CMIP6_LR), some of the entries for A_WCYCL1950S_CMIP6_LR look weird:
<entry id="CLM_CONFIG_OPTS" value="-phys clm4_5 -phys clm4_5 -cppdefs -DMODAL_AER -phys clm4_5 -cppdefs '-DMODAL_AER -DAPPLY_POST_DECK_BUGFIXES'">
and
<entry id="CAM_CONFIG_OPTS" value="-phys cam5 -clubb_sgs -microphys mg2 -chem linoz_mam4_resus_mom_soag -rain_evap_to_coarse_aero -nlev 72 -clubb_sgs -microphys mg2 -chem linoz_mam4_resus_mom_soag -rain_evap_to_coarse_aero -nlev 72 -clubb_sgs -microphys mg2 -chem linoz_mam4_resus_mom_soag -rain_evap_to_coarse_aero -cppdefs -DAPPLY_POST_DECK_BUGFIXES -nlev 72 -bc_dep_to_snow_updates">
I guess that they are okay, but a lot of duplicate or triplicate entries, e.g. -clubb_sgs (3 times!), -microphys mg2 (3 times), etc.
Only thing I see at the moment is A_WCYCL1850S has both
-chem linoz_mam4_resus_mom
-chem linoz_mam4_resus_mom_soag
in CAM_CONFIG_OPTS, but A_WCYCL1950S_CMIP6_LR only has
-chem linoz_mam4_resus_mom_soag
(and -DAPPLY_POST_DECK_BUGFIXES). Guess that I will look at the APPLY_POST_DECK_BUGFIXES ifdefs.
@worleyph - before looking at APPLY_POST_DECK_BUGFIXES, I'd ask whether you saw non-BFB behavior before last Tuesday because that's when APPLY_POST_DECK_BUGFIXES was added. First, I'd check whether A_WCYCL1850S_CMIP6 has the same problems, because it should have identical code choices to 1950, but without APPLY_POST_DECK_BUGFIXES.
In terms of repeated C?M_CONFIG_OPTS: this seems like something that is annoying but harmless... I'd suggest waiting until after the release to clean this up.
Having both chem=linoz_mam4_resus_mom and linoz_mam4_resus_mom_soag seems like a problem since (I imagine) chem can only be one or the other of these choices. Which one is chosen? Is this behavior robust? Pinging @cameronsmith1 because he's the linoz expert.
@PeterCaldwell : Too late! But adding the istanbul cpu target to the files containing APPLY_POST_DECK_BUGFIXES token did not eliminate the loss of reproducibility. I'll try your suggestion next. Thanks.
Buildnamelist is ugly. It contains many IF statements that can create or duplicate CONFIG_OPTS. This should be harmless, as later options should supersede earlier options. Cleaning up the mess is complicated, especially if we don't want to change previous compsets. Personally, I think we need a deep dive on this, but not until after the release.
@cameronsmith1 , thanks. I did compare build logs, and ATM for A_WCYCL1850S_CMIP6 and A_WCYCL1850S are being compiled the same, so the duplicates/triplicates are not causing any problems.
@PeterCaldwell , A_WCYCL1850S_CMIP6 is not reproducible while A_WCYCL1850S is. Got to be a runtime difference then? The differences in env_run.xml are minimal and do not provide much guidance. More digging ...
The main differences between A_WCYCL1850S_CMIP6 and A_WCYCL1850S involve different datasets and different parameter values, and it is hard to see how that could cause non-reproducibility with different PE layouts. The one thing I can think of that changed and affected the code is the stratospheric aerosols. This was a late addition because there was no good way to simply modify the dataset. The new stratospheric aerosol code does involve horizontal regridding, so it is conceivable that there could be a threading issue.
I checked with @golaz , and he confirmed that the CMIP6 version is BFB with different PE layouts, but that was only tested on Edison, and we think all the PE_layouts used nthreads=1.
To switch the volcanic aerosol code between the CMIP6 type and the pre-CMIP6 type, use the prescribed_volcaero_X options in the CAM namelist (I assume the differences are obvious from comparing the namelists from your CMIP6 and pre-CMIP6simulations).
I am pinging @singhbalwinder , who wrote the new strat-aerosol coding.
@cameronsmith1 - sorry - this keeps scaring people. THIS IS ONLY AN ISSUE ON TITAN AND ONLY FOR PGI. I know how to solve it - reintroduce the istanbul CPU target when building ATM. This is overkill, and we have been trying to get away from this. I just need to identify which additional files need to compiled in this special way.
@cameronsmith1 , the differences I see in atm_in are
135,138c135
< &cospsimulator_nl
< cosp_lite = .true.
246,247d242
< ieflx_opt = 2
< l_ieflx_fix = .true.
280,286d274
< &prescribed_volcaero_nl
< prescribed_volcaero_cycle_yr = 1
< prescribed_volcaero_datapath = '/lustre/atlas1/cli900/world-shared/cesm/inputdata/atm/cam/volc'
< prescribed_volcaero_file = 'CMIP_DOE-ACME_radiation_average_1850-2014_v3_c20171204.nc'
< prescribed_volcaero_filetype = 'VOLC_CMIP6'
< prescribed_volcaero_type = 'CYCLICAL'
< /
I'll look at the new stratospheric aerosol code then (thanks).
OK. Those other namelist changes are to (a) activate the COSP simulator and to (b) activate the new energy fixer. Those are also candidates for giving the problem you see. Both could easily be turned off for testing purposes.
Thanks @cameronsmith1 for pinging me. Like @cameronsmith1 mentioned, we did do BFB testing with those codes. I ran few tests on PNNL cluster Constance as well. I did all the testing using Intel compiler and I think @golaz also used Intel on Edison.
I am pinging @singhbalwinder , who wrote the new strat-aerosol coding.
@singhbalwinder , where would I find the new strat-aerosol coding? I looked at check_energy.F90 and prescribed_volcaero.F90, and neither of them was sufficient to eliminate the non-reproducibility (on Titan, using PGI, without using the istanbul CPU target globally). The new stratospheric aerosol code is my last idea before going back to bisecting all of ATM. Thanks.
The code is in file: components/cam/src/chemistry/utils/read_volc_radiation_data.F90.
@singhbalwinder (and @cameronsmith1 and @PeterCaldwell), Thank you! Adding the istanbul CPU target to the compilation of read_volc_radiation_data.F90 solved the problem. I need to back out my other modifications now, to verify that this is the only change that is needed.
@worleyph - do you understand what about volc_radiation_data.F90 requires the istanbul CPU target? Is there some best-practices coding approach we're not following for that file? I'm just trying to turn this finding into a teachable moment.
@PeterCaldwell - this is a PGI-specific issue that only shows up on Titan. I've looked at the other files that require this in the past, and nothing was obvious. I've stopped looking, and just determine this empirically. SInce intel is being used for production at the moment, and since pgiacc is not useful at the moment (works, but is slow), due also to a compiler bug, maintaining PGI correctness is simply a holding operation and as one more test platform for the code.
The only thing I learned is that new code needs to be included in tests that are run on Titan/pgi. Tests with CMIP6 are being run, but apparently have not made it to Titan yet.
Ok, sounds reasonable. Thanks @worleyph .
Thanks @worleyph . I will take a look a that file once more see if anything stands out. Do you have a list of the files which require this flag? It might help in finding out what might be causing this.
@singhbalwinder , while I feel that is not a good way to spend your time ... please look at Depends.titan.pgi in cime/config/e3sm/machines/
Thanks @worleyph . I just took a quick look at the list and it also has file tracer_data.F90
. I borrowed some of its functionality in read_volc_radiation_data.F90
, so I am now not that surprised that this file would also need that special flag. Other file there was interpolate_data.F90
which contain interpolation routines, so this PGI specific bug might be related to a combination of reading netcdf and interpolating data (just a guess).
Update: I needed to add a special rule (to add the istanbul CPU target) for both check_energy.F90 and read_volc_radiation_data.F90 . I have not run with -cosp in env_build.xml. I want to add this, and then also check the high resolution case again. If all of these pass, I'll submit a bug fix PR to address (and close) this issue.
For the case
Changing
to
even when
will change the output (atm.log):
(and many others), and then
This is not true for the other components. This lack of reproducibility can be elimianted by adding
to env_mach_speciifc.xml (though it is really only required for ATM). The current Depends.titan.pgi file add this CPU target to specific files when compiling, but apparently additional files are needed for this case, or perhaps recent updates to the PGI compiler version require this. In any case, we need to either quickly find the offending file or reapply this CPU target more globally.