E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
341 stars 348 forks source link

Some compsets not B4B wrt thread count changes in ATM on Titan/PGI #2227

Closed worleyph closed 6 years ago

worleyph commented 6 years ago

For the case

 -compset A_WCYCL1950S_CMIP6_LR -res ne30_oECv3_ICG -mach titan -compiler pgi

Changing

  <entry id="NTHRDS">
  ...
    <value compclass="ATM">1</value>

to

  <entry id="NTHRDS">
  ...
    <value compclass="ATM">2</value>

even when

  <entry id="BUILD_THREADED" value="TRUE">

will change the output (atm.log):

 5874c5874
 <   omega =  -0.577105429346257E-04  0.103056972035999E-03  0.109216360956877E-01
 ---
 >   omega =  -0.577105429346257E-04  0.103056972035999E-03  0.109216360956909E-01

(and many others), and then

 6078,6124c6078,6124
 <  nstep, te        2   0.33533902292320843E+10   0.33533895810875187E+10  -0.35840054712975032E-04   0.98531199320322747E+05
 ...
 >  nstep, te        2   0.33533902292249117E+10   0.33533895810776653E+10  -0.35840202950360601E-04   0.98531199320290354E+05

This is not true for the other components. This lack of reproducibility can be elimianted by adding

<env name="CRAY_CPU_TARGET">istanbul</env>

to env_mach_speciifc.xml (though it is really only required for ATM). The current Depends.titan.pgi file add this CPU target to specific files when compiling, but apparently additional files are needed for this case, or perhaps recent updates to the PGI compiler version require this. In any case, we need to either quickly find the offending file or reapply this CPU target more globally.

rljacob commented 6 years ago

We should also check to see if this is a problem on other platforms. @ndkeen can you check cori-knl and @amametjanov can you check theta? You could try PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR .

amametjanov commented 6 years ago

No problems on Theta (with git master v1.0.0-beta.3-2399-g03b8e4ad3):

> ./cs.status.20180402_173214_g7dhdr 
  PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel (Overall: PASS) details:
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel CREATE_NEWCASE
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel XML
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel SETUP
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel SHAREDLIB_BUILD time=337
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel MODEL_BUILD time=508
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel SUBMIT
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel RUN time=528
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel COMPARE_base_single_thread
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.theta_intel MEMLEAK insuffiencient data for memleak test
amametjanov commented 6 years ago

Result from Cori-KNL is also a PASS (with git master v1.0.0-beta.3-2399-g03b8e4ad3)

> ./cs.status.20180402_103710_derqdb 
  PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel (Overall: PASS) details:
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel CREATE_NEWCASE
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel XML
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel SETUP
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel SHAREDLIB_BUILD time=682
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel MODEL_BUILD time=1440
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel SUBMIT
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel RUN time=631
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel COMPARE_base_single_thread
    PASS PET_Ln9_PS.ne30_oECv3_ICG.A_WCYCL1950S_CMIP6_LR.cori-knl_intel MEMLEAK insuffiencient data for memleak test
worleyph commented 6 years ago

For Titan/PGI, just checked both

 --compset FC5AV1C-L --res ne4_ne4 

and

 --compset A_WCYCL1850S --res ne30_oECv3_ICG

and they were both reproducible with the same changes to env_mach_pes.xml for which

 -compset A_WCYCL1950S_CMIP6_LR -res ne30_oECv3_ICG 

was not reproducible (changing threading from 1 to 2 in ATM only, with BUILD_THREADED=TRUE). So, something particular to this compset is causing the problem.

PeterCaldwell commented 6 years ago

Hi @worleyph - this is alarming. Another compset to try would be A_WCYCL1850S_CMIP6. This is what we're using for the CMIP6 "deck" runs and also exercised identical code to the 1950 compset until changes went onto master last Tuesday which added some more differences.

worleyph commented 6 years ago

@PeterCaldwell , this is just a Titan issue, and only for PGI. I would not be concerned. But thanks for the tip.

worleyph commented 6 years ago

@PeterCaldwell , since you are "listening", when comparing env_build.xml for the two compsets (A_WCYCL1850S and A_WCYCL1950S_CMIP6_LR), some of the entries for A_WCYCL1950S_CMIP6_LR look weird:

  <entry id="CLM_CONFIG_OPTS" value="-phys clm4_5 -phys clm4_5 -cppdefs -DMODAL_AER -phys clm4_5 -cppdefs '-DMODAL_AER -DAPPLY_POST_DECK_BUGFIXES'">

and

 <entry id="CAM_CONFIG_OPTS" value="-phys cam5 -clubb_sgs -microphys mg2 -chem linoz_mam4_resus_mom_soag -rain_evap_to_coarse_aero -nlev 72 -clubb_sgs -microphys mg2 -chem linoz_mam4_resus_mom_soag -rain_evap_to_coarse_aero -nlev 72 -clubb_sgs -microphys mg2 -chem linoz_mam4_resus_mom_soag -rain_evap_to_coarse_aero -cppdefs -DAPPLY_POST_DECK_BUGFIXES -nlev 72 -bc_dep_to_snow_updates">

I guess that they are okay, but a lot of duplicate or triplicate entries, e.g. -clubb_sgs (3 times!), -microphys mg2 (3 times), etc.

worleyph commented 6 years ago

Only thing I see at the moment is A_WCYCL1850S has both

 -chem linoz_mam4_resus_mom
 -chem linoz_mam4_resus_mom_soag

in CAM_CONFIG_OPTS, but A_WCYCL1950S_CMIP6_LR only has

 -chem linoz_mam4_resus_mom_soag

(and -DAPPLY_POST_DECK_BUGFIXES). Guess that I will look at the APPLY_POST_DECK_BUGFIXES ifdefs.

PeterCaldwell commented 6 years ago

@worleyph - before looking at APPLY_POST_DECK_BUGFIXES, I'd ask whether you saw non-BFB behavior before last Tuesday because that's when APPLY_POST_DECK_BUGFIXES was added. First, I'd check whether A_WCYCL1850S_CMIP6 has the same problems, because it should have identical code choices to 1950, but without APPLY_POST_DECK_BUGFIXES.

PeterCaldwell commented 6 years ago

In terms of repeated C?M_CONFIG_OPTS: this seems like something that is annoying but harmless... I'd suggest waiting until after the release to clean this up.

Having both chem=linoz_mam4_resus_mom and linoz_mam4_resus_mom_soag seems like a problem since (I imagine) chem can only be one or the other of these choices. Which one is chosen? Is this behavior robust? Pinging @cameronsmith1 because he's the linoz expert.

worleyph commented 6 years ago

@PeterCaldwell : Too late! But adding the istanbul cpu target to the files containing APPLY_POST_DECK_BUGFIXES token did not eliminate the loss of reproducibility. I'll try your suggestion next. Thanks.

cameronsmith1 commented 6 years ago

Buildnamelist is ugly. It contains many IF statements that can create or duplicate CONFIG_OPTS. This should be harmless, as later options should supersede earlier options. Cleaning up the mess is complicated, especially if we don't want to change previous compsets. Personally, I think we need a deep dive on this, but not until after the release.

worleyph commented 6 years ago

@cameronsmith1 , thanks. I did compare build logs, and ATM for A_WCYCL1850S_CMIP6 and A_WCYCL1850S are being compiled the same, so the duplicates/triplicates are not causing any problems.

@PeterCaldwell , A_WCYCL1850S_CMIP6 is not reproducible while A_WCYCL1850S is. Got to be a runtime difference then? The differences in env_run.xml are minimal and do not provide much guidance. More digging ...

cameronsmith1 commented 6 years ago

The main differences between A_WCYCL1850S_CMIP6 and A_WCYCL1850S involve different datasets and different parameter values, and it is hard to see how that could cause non-reproducibility with different PE layouts. The one thing I can think of that changed and affected the code is the stratospheric aerosols. This was a late addition because there was no good way to simply modify the dataset. The new stratospheric aerosol code does involve horizontal regridding, so it is conceivable that there could be a threading issue.

I checked with @golaz , and he confirmed that the CMIP6 version is BFB with different PE layouts, but that was only tested on Edison, and we think all the PE_layouts used nthreads=1.

To switch the volcanic aerosol code between the CMIP6 type and the pre-CMIP6 type, use the prescribed_volcaero_X options in the CAM namelist (I assume the differences are obvious from comparing the namelists from your CMIP6 and pre-CMIP6simulations).

I am pinging @singhbalwinder , who wrote the new strat-aerosol coding.

worleyph commented 6 years ago

@cameronsmith1 - sorry - this keeps scaring people. THIS IS ONLY AN ISSUE ON TITAN AND ONLY FOR PGI. I know how to solve it - reintroduce the istanbul CPU target when building ATM. This is overkill, and we have been trying to get away from this. I just need to identify which additional files need to compiled in this special way.

worleyph commented 6 years ago

@cameronsmith1 , the differences I see in atm_in are

 135,138c135
 < &cospsimulator_nl
 <  cosp_lite       = .true.
 246,247d242
 <  ieflx_opt       =  2     
 <  l_ieflx_fix     =  .true.
 280,286d274
 < &prescribed_volcaero_nl
 <  prescribed_volcaero_cycle_yr        = 1                                                        
 <  prescribed_volcaero_datapath        = '/lustre/atlas1/cli900/world-shared/cesm/inputdata/atm/cam/volc'
 <  prescribed_volcaero_file        = 'CMIP_DOE-ACME_radiation_average_1850-2014_v3_c20171204.nc'
 <  prescribed_volcaero_filetype        = 'VOLC_CMIP6'
 <  prescribed_volcaero_type        = 'CYCLICAL'
 < /

I'll look at the new stratospheric aerosol code then (thanks).

cameronsmith1 commented 6 years ago

OK. Those other namelist changes are to (a) activate the COSP simulator and to (b) activate the new energy fixer. Those are also candidates for giving the problem you see. Both could easily be turned off for testing purposes.

singhbalwinder commented 6 years ago

Thanks @cameronsmith1 for pinging me. Like @cameronsmith1 mentioned, we did do BFB testing with those codes. I ran few tests on PNNL cluster Constance as well. I did all the testing using Intel compiler and I think @golaz also used Intel on Edison.

worleyph commented 6 years ago

I am pinging @singhbalwinder , who wrote the new strat-aerosol coding.

@singhbalwinder , where would I find the new strat-aerosol coding? I looked at check_energy.F90 and prescribed_volcaero.F90, and neither of them was sufficient to eliminate the non-reproducibility (on Titan, using PGI, without using the istanbul CPU target globally). The new stratospheric aerosol code is my last idea before going back to bisecting all of ATM. Thanks.

singhbalwinder commented 6 years ago

The code is in file: components/cam/src/chemistry/utils/read_volc_radiation_data.F90.

worleyph commented 6 years ago

@singhbalwinder (and @cameronsmith1 and @PeterCaldwell), Thank you! Adding the istanbul CPU target to the compilation of read_volc_radiation_data.F90 solved the problem. I need to back out my other modifications now, to verify that this is the only change that is needed.

PeterCaldwell commented 6 years ago

@worleyph - do you understand what about volc_radiation_data.F90 requires the istanbul CPU target? Is there some best-practices coding approach we're not following for that file? I'm just trying to turn this finding into a teachable moment.

worleyph commented 6 years ago

@PeterCaldwell - this is a PGI-specific issue that only shows up on Titan. I've looked at the other files that require this in the past, and nothing was obvious. I've stopped looking, and just determine this empirically. SInce intel is being used for production at the moment, and since pgiacc is not useful at the moment (works, but is slow), due also to a compiler bug, maintaining PGI correctness is simply a holding operation and as one more test platform for the code.

The only thing I learned is that new code needs to be included in tests that are run on Titan/pgi. Tests with CMIP6 are being run, but apparently have not made it to Titan yet.

PeterCaldwell commented 6 years ago

Ok, sounds reasonable. Thanks @worleyph .

singhbalwinder commented 6 years ago

Thanks @worleyph . I will take a look a that file once more see if anything stands out. Do you have a list of the files which require this flag? It might help in finding out what might be causing this.

worleyph commented 6 years ago

@singhbalwinder , while I feel that is not a good way to spend your time ... please look at Depends.titan.pgi in cime/config/e3sm/machines/

singhbalwinder commented 6 years ago

Thanks @worleyph . I just took a quick look at the list and it also has file tracer_data.F90. I borrowed some of its functionality in read_volc_radiation_data.F90, so I am now not that surprised that this file would also need that special flag. Other file there was interpolate_data.F90 which contain interpolation routines, so this PGI specific bug might be related to a combination of reading netcdf and interpolating data (just a guess).

worleyph commented 6 years ago

Update: I needed to add a special rule (to add the istanbul CPU target) for both check_energy.F90 and read_volc_radiation_data.F90 . I have not run with -cosp in env_build.xml. I want to add this, and then also check the high resolution case again. If all of these pass, I'll submit a bug fix PR to address (and close) this issue.