E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 360 forks source link

ftn-2116 compiler internal error from "optcg" of Cray compiler CCE/14.0.0 on Crusher #4997

Closed grnydawn closed 1 year ago

grnydawn commented 2 years ago

This error occurred during compilation of "E3SM/components/mpas-framework/src/core_ocean/gotm/src/turbulence/turbulence.F90"

The error message is:

ftn-2116 ftn: INTERNAL "/opt/cray/pe/cce/14.0.0/cce/x86_64/bin/optcg" was terminated due to receipt of signal 013: Segmentation fault (core dumped).

Test case name is "ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang"

Entire command line is:

cd /gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/cmake-bld/mpas-framework/src && python3 /gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/Tools/e3sm_compile_wrap.py ftn -DCORE_OCEAN -DCPRCRAY -DEXCLUDE_INIT_MODE -DFORTRANUNDERSCORE -DLINUX -DMPAS_ESM_SHR_CONST -DMPAS_EXE_NAME="" -DMPAS_NAMELIST_SUFFIX="" -DMPAS_NO_ESMF_INIT -DMPAS_NO_LOG_REDIRECT -DMPAS_PERF_MOD_TIMERS -DNO_R16 -DOFFSET64BIT -DUSE_LAPACK -DUSE_PIO2 -D_MPI -I/gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/crayclang/mpich/nodebug/nothreads/mct/include -I/gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/crayclang/mpich/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/csm_share -I/gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/crayclang/mpich/nodebug/nothreads/mct/pio -I/opt/cray/pe/parallel-netcdf/1.12.1.7/crayclang/10.0/include -I/autofs/nccs-svm1_home1/grnydawn/repos/github/E3SM/components/mpas-framework/src/external/ezxml -I/gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/cmake-bld/framework -I/gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/cmake-bld/operators -I/opt/cray/pe/netcdf-hdf5parallel/4.8.1.1/crayclang/10.0/include -I/gpfs/alpine/cli133/proj-shared/grnydawn/e3sm_tests/crusher_crayclang_cce14_debug3/ERS_Ld5.T62_oQU120.CMPASO-NYF.crusher_crayclang.20220531_115159_5psojv/bld/cmake-bld/core_ocean/shared -I/autofs/nccs-svm1_home1/grnydawn/repos/github/E3SM/components/mpas-framework/src/core_ocean/gotm/include -f free -N 255 -h byteswapio -em -M1077 -em -J. -c /autofs/nccs-svm1_home1/grnydawn/repos/github/E3SM/components/mpas-framework/src/core_ocean/gotm/src/turbulence/turbulence.F90 -o CMakeFiles/ocn.dir/core_ocean/gotm/src/turbulence/turbulence.F90.o

sarats commented 2 years ago

Try adding turbulence.F90 to the NOOPT list

grnydawn commented 2 years ago

I will try this and share its result here.

sarats commented 2 years ago

Status: this error is encountered for 10 tests in the e3sm_integration test suite and the build error occurs in in core_ocean/shared/mpas_ocn_config.f90. Reproduced by HPE folks. Known workaround: turn off inter-procedural analysis (ipa) optimization.

Details at https://docs.google.com/spreadsheets/d/1I0J9zUXCJufdnlxLSkXaww-w81ME4RLI_0cxKRh6a5U/edit#gid=1084235759

Reproducing: Using https://github.com/E3SM-Project/E3SM/tree/ykim/crusher/craydebug branch

cd ${E3SM_HOME}/cime/scripts

TESTCASE="ERS.f09_g16.I1850ELMCN.crusher_crayclang ERS.f19_f19.I1850ELMCN.crusher_crayclang"

./create_test ${TESTCASE} --machine ${MACH} --compiler ${COMP} --project ${ACCOUNT} --output-root ${OUTDIR}
abbotts commented 2 years ago

I see an HPE ticket for OLCFDEV-977 , which I think is this issue. Could we confirm this is OLCFDEV-977? If so I'll make a note in our internal ticket and keep an eye on it.

grnydawn commented 2 years ago

@abbotts Yes, this issue is the same to OLCFDEV-977.

abbotts commented 1 year ago

@grnydawn , we think this issue should be fixed with CCE 15.0.0, which is available on Crusher now.

grnydawn commented 1 year ago

@abbotts , @sarats . CCE15.0.0 does seem resolve this issue as the test case above was passed with CCE15. I will try to run more test cases with CCE15 and will close this issue if I don't see this issue anymore.

grnydawn commented 1 year ago

This issue is resolved in CCE15. Closing the issue