Open rljacob opened 1 year ago
The ERS fail was actually a code change that needed to be made - https://github.com/E3SM-Project/E3SM/pull/5478/commits/ebc87e5e1dd4ffefbef7338aad35cbb2e484561e This page has more details on why the optimization needed to be reduced. https://acme-climate.atlassian.net/wiki/spaces/NGDAP/pages/3764420634/Tracking+non-BFB-ness+across+machines+with+ZM+enhancement+PR
Thanks @crterai. Looks like that was for pm-cpu with nvidia compiler. The gnu change we were looking at was for anlgce. I updated the description.
Was the "ERS fail" a build fail, a run fail or the test itself failed?
We should also add:
lowering the opt level on
zm_conv.F90
on pm-cpu_gnu made sure that results were BFB with original master when the features were turned off.
I added to overall description, edit as needed.
The ERS fail on pm-cpu_gnu
was while comparing base with restart.
That would be "the test itself". Do you recall which test exactly?
The test was when I turned on the new features. Here it is:
ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements (Overall: FAIL) details:
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements CREATE_NEWCASE
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements XML
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SETUP
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SHAREDLIB_BUILD time=73
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements MODEL_BUILD time=85
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SUBMIT
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements RUN time=118
FAIL ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements COMPARE_base_rest
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements MEMLEAK
PASS ERS.ne4pg2_oQU480.F2010.pm-cpu_gnu.eam-zm_enhancements SHORT_TERM_ARCHIVER
Never mind I see its in that page you pointed to.
On anglce, the ICE build error is something like:
[ 94%] Building Fortran object cmake/atm/CMakeFiles/atm.dir/__/__/eam/src/chemistry/modal_aero/aero_model.F90.o
cd /scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/cmake-bld/cmake/atm && python3 /scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/test_root/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/Tools/e3sm_compile_wrap.py /nfs/gce/projects/climate/software/linux-ubuntu20.04-x86_64/mpich/4.0/gcc-11.1.0/bin/mpif90 -DBIT64 -DCAM -DCLUBB_CAM -DCLUBB_REAL_TYPE=dp -DCLUBB_SGS -DCO2A -DCPRGNU -DFORTRANUNDERSCORE -DHAVE_COMM_F2C -DHAVE_F2003_PTR_BND_REMAP -DHAVE_GETTIMEOFDAY -DHAVE_MPI -DHAVE_NANOTIME -DHAVE_SLASHPROC -DHAVE_TIMES -DHAVE_VPRINTF -DHOMME_ENABLE_COMPOSE -DLINUX -DLSMLAT=1 -DLSMLON=1 -DMAXPATCH_PFT=numpft+1 -DMCT_INTERFACE -DMODAL_AER -DMODAL_AERO -DMODAL_AERO_4MODE_MOM -DMODEL_THETA_L -DNC=4 -DNDEBUG -DNO_LAPACK_ISNAN -DNO_R16 -DNP=4 -DNPG=2 -DN_RAD_CNST=30 -DPCNST=40 -DPCOLS=16 -DPLAT=1 -DPLEV=72 -DPLON=384 -DPSUBCOLS=1 -DPTRK=1 -DPTRM=1 -DPTRN=1 -DRAIN_EVAP_TO_COARSE_AERO -DSPMD -D_MPI -D_PNETCDF -D_PRIM -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/cmake/atm/. -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/gnu/mpich/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/nfs/gce/projects/climate/software/linux-ubuntu20.04-x86_64/netcdf/4.8.0c-4.3.1cxx-4.5.3f-parallel/mpich-4.0/gcc-11.1.0/include -I/nfs/gce/projects/climate/software/linux-ubuntu20.04-x86_64/pnetcdf/1.12.2/mpich-4.0/gcc-11.1.0/include -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/test_root/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/SourceMods/src.eam -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/pp_linoz_mam4_resus_mom_soag -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/modal_aero -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/aerosol -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/mozart -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/utils -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg/ext/rrtmg_mcica -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg/ext/rrtmg_lw -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/rrtmg/ext/rrtmg_sw -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/cam -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/clubb -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/physics/p3/eam -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/dynamics/se -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/share -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/theta-l -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/theta-l/share -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/homme/src/share/compose -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/cpl -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/control -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/utils -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/lnd/obj -I/scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/SMS_Ln5.ne4pg2_oQU480.F2010.anlgce_gnu.C.20230518_004031_773doc/bld/gnu/mpich/nodebug/nothreads/mct/include -I/usr/include -mcmodel=medium -fconvert=big-endian -ffree-line-length-none -ffixed-line-length-none -fallow-argument-mismatch -O -O2 -fallow-argument-mismatch -fallow-invalid-boz -ffree-form -DUSE_CONTIGUOUS= -c /scratch/jenkins-slave/workspace/E3SM_DEVELOPER_TESTS/E3SM/components/eam/src/chemistry/modal_aero/aero_model.F90 -o CMakeFiles/atm.dir/__/__/eam/src/chemistry/modal_aero/aero_model.F90.o
gfortran: internal compiler error: Segmentation fault signal terminated program f951
Expected to be fixed by https://github.com/E3SM-Project/E3SM/commit/fe44772cbe4deb313106a3a8f81156da4d1ae2d0
For GNU compilers, this ICE issue seems to be only reproducible with GCC 11 or higher: Reproducible on anlgce (11.1) Not reproducible on mappy (8.1 or 9.2)
Besides anlgce, we need also look at some other E3SM machines that use GCC 11 or higher. As mentioned by @sarats, Frontier/Crusher use gcc 11.2.
And Google Cloud has 12.2 and doesn't see this problem. We can conclude 11.1 is buggy and anlgce should upgrade.
It seems that Cray Fortran compiler(15.0.1) on Crusher has this issue too. Most of build errors shown in CDash Crusher tests(https://my.cdash.org/viewTest.php?onlyfailed&buildid=2337708) have ZM_CONV error messages.
It seems that Cray Fortran compiler(15.0.1) on Crusher has this issue too. Most of build errors shown in CDash Crusher tests(https://my.cdash.org/viewTest.php?onlyfailed&buildid=2337708) have ZM_CONV error messages.
The remaining errors are related to jctop
and jcbot
. These were not modified in #5724.
The new ZM features in https://github.com/E3SM-Project/E3SM/pull/5478 are sensitive to optimization on some machines/compilers. Sensitivity is seen both with new features on and off.
With new features OFF: On anlgce/gnu (gcc 11.1): Have to reduce from -O2 to no optimization on
zm_conv.F90
to avoid ICE. See https://github.com/E3SM-Project/E3SM/pull/5478/commits/fe44772cbe4deb313106a3a8f81156da4d1ae2d0On pm-cpu/gnu (gcc 11.2): Have to add
zm_conv.F90
to NOOPT to ensure BFB with master even when features are turned off. Testing against baselines: answers changed for all cases with EAM EXCEPT those with MMF.When new features are ON: pm-cpu/nvidia ERS test failed. ERS_D.ne4pg2_oQU480.F2010.pm-cpu_nvidia.eam-zm_enhancements. A fix was made.
On chrysalis/intel: have to reduce opt for two files (
zm_conv.F90
andzm_microphysics.F90
) from -O3 to -O2 or RESTOM was changed significantly (compared to a baseline on compy) Testing against baselines no change to answers?These problems were found in testing before PR #5478 was merged to master.