E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
347 stars 354 forks source link

New ZM sensitive to opimization #5702

Open rljacob opened 1 year ago

rljacob commented 1 year ago

The new ZM features in https://github.com/E3SM-Project/E3SM/pull/5478 are sensitive to optimization on some machines/compilers. Sensitivity is seen both with new features on and off.

With new features OFF: On anlgce/gnu (gcc 11.1): Have to reduce from -O2 to no optimization on zm_conv.F90 to avoid ICE. See https://github.com/E3SM-Project/E3SM/pull/5478/commits/fe44772cbe4deb313106a3a8f81156da4d1ae2d0

On pm-cpu/gnu (gcc 11.2): Have to add zm_conv.F90 to NOOPT to ensure BFB with master even when features are turned off. Testing against baselines: answers changed for all cases with EAM EXCEPT those with MMF.

When new features are ON: pm-cpu/nvidia ERS test failed. ERS_D.ne4pg2_oQU480.F2010.pm-cpu_nvidia.eam-zm_enhancements. A fix was made.

On chrysalis/intel: have to reduce opt for two files (zm_conv.F90 and zm_microphysics.F90) from -O3 to -O2 or RESTOM was changed significantly (compared to a baseline on compy) Testing against baselines no change to answers?

These problems were found in testing before PR #5478 was merged to master.

crterai commented 1 year ago

The ERS fail was actually a code change that needed to be made - https://github.com/E3SM-Project/E3SM/pull/5478/commits/ebc87e5e1dd4ffefbef7338aad35cbb2e484561e This page has more details on why the optimization needed to be reduced. https://acme-climate.atlassian.net/wiki/spaces/NGDAP/pages/3764420634/Tracking+non-BFB-ness+across+machines+with+ZM+enhancement+PR

rljacob commented 1 year ago

Thanks @crterai. Looks like that was for pm-cpu with nvidia compiler. The gnu change we were looking at was for anlgce. I updated the description.

rljacob commented 1 year ago

Was the "ERS fail" a build fail, a run fail or the test itself failed?

sarats commented 1 year ago

We should also add:

lowering the opt level on zm_conv.F90 on pm-cpu_gnu made sure that results were BFB with original master when the features were turned off.

I added to overall description, edit as needed.

crterai commented 1 year ago

The ERS fail on pm-cpu_gnu was while comparing base with restart.

rljacob commented 1 year ago

That would be "the test itself". Do you recall which test exactly?

crterai commented 1 year ago

The test was when I turned on the new features. Here it is:

  ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements (Overall: FAIL) details:
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements CREATE_NEWCASE
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements XML
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SETUP
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SHAREDLIB_BUILD time=73
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements MODEL_BUILD time=85
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SUBMIT
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements RUN time=118
    FAIL ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements COMPARE_base_rest
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements MEMLEAK
    PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SHORT_TERM_ARCHIVER
rljacob commented 1 year ago

Never mind I see its in that page you pointed to.

dqwu commented 1 year ago

On anglce, the ICE build error is something like:

[ 94%] Building Fortran object cmake/atm/CMakeFiles/atm.dir/__/__/eam/src/chemistry/modal_aero/aero_model.F90.o
cd /scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/cmake-bld/cmake/atm && python3 /scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/test_root/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/Tools/e3sm_compile_wrap.py  /nfs/gce/projects/climate/software/linux-ubuntu20.04-x86_64/mpich/4.0/gcc-11.1.0/bin/mpif90 -DBIT64 -DCAM -DCLUBB_CAM -DCLUBB_REAL_TYPE=dp -DCLUBB_SGS -DCO2A -DCPRGNU -DFORTRANUNDERSCORE -DHAVE_COMM_F2C -DHAVE_F2003_PTR_BND_REMAP -DHAVE_GETTIMEOFDAY -DHAVE_MPI -DHAVE_NANOTIME -DHAVE_SLASHPROC -DHAVE_TIMES -DHAVE_VPRINTF -DHOMME_ENABLE_COMPOSE -DLINUX -DLSMLAT=1 -DLSMLON=1 -DMAXPATCH_PFT=numpft+1 -DMCT_INTERFACE -DMODAL_AER -DMODAL_AERO -DMODAL_AERO_4MODE_MOM -DMODEL_THETA_L -DNC=4 -DNDEBUG -DNO_LAPACK_ISNAN -DNO_R16 -DNP=4 -DNPG=2 -DN_RAD_CNST=30 -DPCNST=40 -DPCOLS=16 -DPLAT=1 -DPLEV=72 -DPLON=384 -DPSUBCOLS=1 -DPTRK=1 -DPTRM=1 -DPTRN=1 -DRAIN_EVAP_TO_COARSE_AERO -DSPMD -D_MPI -D_PNETCDF -D_PRIM -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/cmake/atm/. -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/gnu/mpich/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/nfs/gce/projects/climate/software/linux-ubuntu20.04-x86_64/netcdf/4.8.0c-4.3.1cxx-4.5.3f-parallel/mpich-4.0/gcc-11.1.0/include -I/nfs/gce/projects/climate/software/linux-ubuntu20.04-x86_64/pnetcdf/1.12.2/mpich-4.0/gcc-11.1.0/include -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/test_root/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/SourceMods/src.eam -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/pp_linoz_mam4_resus_mom_soag -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/modal_aero -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/aerosol -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/mozart -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/utils -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg/ext/rrtmg_mcica -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg/ext/rrtmg_lw -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg/ext/rrtmg_sw -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/cam -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/clubb -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/p3/eam -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/dynamics/se -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/share -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/theta-l -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/theta-l/share -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/share/compose -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/cpl -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/control -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/utils -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/lnd/obj -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/gnu/mpich/nodebug/nothreads/mct/include -I/usr/include   -mcmodel=medium -fconvert=big-endian -ffree-line-length-none -ffixed-line-length-none -fallow-argument-mismatch -O -O2 -fallow-argument-mismatch -fallow-invalid-boz   -ffree-form -DUSE_CONTIGUOUS=  -c /scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/modal_aero/aero_model.F90 -o CMakeFiles/atm.dir/__/__/eam/src/chemistry/modal_aero/aero_model.F90.o
gfortran: internal compiler error: Segmentation fault signal terminated program f951

Expected to be fixed by https://github.com/E3SM-Project/E3SM/commit/fe44772cbe4deb313106a3a8f81156da4d1ae2d0

For GNU compilers, this ICE issue seems to be only reproducible with GCC 11 or higher: Reproducible on anlgce (11.1) Not reproducible on mappy (8.1 or 9.2)

Besides anlgce, we need also look at some other E3SM machines that use GCC 11 or higher. As mentioned by @sarats, Frontier/Crusher use gcc 11.2.

rljacob commented 1 year ago

And Google Cloud has 12.2 and doesn't see this problem. We can conclude 11.1 is buggy and anlgce should upgrade.

grnydawn commented 1 year ago

It seems that Cray Fortran compiler(15.0.1) on Crusher has this issue too. Most of build errors shown in CDash Crusher tests(https://my.cdash.org/viewTest.php?onlyfailed&buildid=2337708) have ZM_CONV error messages.

ambrad commented 1 year ago

It seems that Cray Fortran compiler(15.0.1) on Crusher has this issue too. Most of build errors shown in CDash Crusher tests(https://my.cdash.org/viewTest.php?onlyfailed&buildid=2337708) have ZM_CONV error messages.

The remaining errors are related to jctop and jcbot. These were not modified in #5724.