ESCOMP / FMS

CESM fork of FMS
GNU Lesser General Public License v3.0
0 stars 1 forks source link

intel-oneapi/2024.0.2 ICE #3

Closed jedwards4b closed 3 weeks ago

jedwards4b commented 4 weeks ago

Opening an issue here to track the problem. It first appeared in a cesm build using the dev/ncar_0.0.3 tag but since I can reproduce it on master that's how I reported it.

Here is a reproducer:

Currently Loaded Modules: 1) cesmdev/1.0 (H,S) 4) cmake/3.26.3 7) ncarcompilers/1.0.0 10) netcdf-mpi/4.9.2 13) esmf/8.6.0 2) ncarenv/23.09 (S) 5) intel-oneapi/2024.0.2 8) cray-mpich/8.1.27 11) parallel-netcdf/1.12.3 3) craype/2.7.31 6) mkl/2024.0.0 9) hdf5-mpi/1.14.3 12) parallelio/2.6.2

Where: S: Module is Sticky, requires --force to unload or purge H: Hidden Module

git clone https://github.com/ESCOMP/FMS mkdir bld cd bld cmake ../FMS make

cmake ../FMS/ -- The C compiler identification is IntelLLVM 2024.0.2 -- The Fortran compiler identification is IntelLLVM 2024.0.2 -- Cray Programming Environment 2.7.31 C -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /glade/u/apps/derecho/23.09/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2024.0.2/3szf/bin/icx - skipped -- Detecting C compile features -- Detecting C compile features - done -- Cray Programming Environment 2.7.31 Fortran -- Detecting Fortran compiler ABI info -- Detecting Fortran compiler ABI info - done -- Check for working Fortran compiler: /glade/u/apps/derecho/23.09/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2024.0.2/3szf/bin/ifx - skipped -- Setting build type to 'Release' as none was specified. -- Found MPI_C: /glade/u/apps/derecho/23.09/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2024.0.2/3szf/bin/icx (found version "3.1") -- Found MPI_Fortran: /glade/u/apps/derecho/23.09/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2024.0.2/3szf/bin/ifx (found version "3.1") -- Found MPI: TRUE (found version "3.1") found components: C Fortran -- Found NetCDF: /glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5/include (found version "4.9.2") found components: C Fortran -- FindNetCDF defines targets: -- - NetCDF_VERSION [4.9.2] -- - NetCDF_PARALLEL [TRUE] -- - NetCDF_C_CONFIG_EXECUTABLE [/glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5/bin/nc-config] -- - NetCDF::NetCDF_C [SHARED] [Root: /glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5] Lib: /glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5/lib/libnetcdf.so -- - NetCDF_Fortran_CONFIG_EXECUTABLE [/glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5/bin/nf-config] -- - NetCDF::NetCDF_Fortran [SHARED] [Root: /glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5] Lib: /glade/u/apps/derecho/23.09/spack/opt/spack/netcdf/4.9.2/cray-mpich/8.1.27/oneapi/2024.0.2/fmc5/lib/libnetcdff.so -- Looking for gettid -- Looking for gettid - found -- Configuring done (14.2s) -- Generating done (1.0s) -- Build files have been written to: /glade/derecho/scratch/jedwards/fmsbug/bld derecho3: /glade/derecho/scratch/jedwards/fmsbug/bld :) make [ 1%] Building C object CMakeFiles/fms_r4_c.dir/affinity/affinity.c.o [ 2%] Building C object CMakeFiles/fms_r4_c.dir/fms/fms_stacksize.c.o [ 3%] Building C object CMakeFiles/fms_r4_c.dir/mosaic/create_xgrid.c.o /glade/derecho/scratch/jedwards/fmsbug/FMS/mosaic/create_xgrid.c:42:1: warning: '/*' within block comment [-Wcomment] 42 | /*** | ^ 1 warning generated. [ 4%] Building C object CMakeFiles/fms_r4_c.dir/mosaic/gradient_c2l.c.o [ 5%] Building C object CMakeFiles/fms_r4_c.dir/mosaic/interp.c.o [ 6%] Building C object CMakeFiles/fms_r4_c.dir/mosaic/mosaic_util.c.o [ 7%] Building C object CMakeFiles/fms_r4_c.dir/mosaic/read_mosaic.c.o [ 8%] Building C object CMakeFiles/fms_r4_c.dir/mpp/mpp_memuse.c.o [ 9%] Building C object CMakeFiles/fms_r4_c.dir/parser/yaml_parser_binding.c.o [ 10%] Building C object CMakeFiles/fms_r4_c.dir/parser/yaml_output_functions.c.o [ 11%] Building C object CMakeFiles/fms_r4_c.dir/string_utils/fms_string_utils_binding.c.o [ 11%] Built target fms_r4_c [ 12%] Building Fortran object CMakeFiles/fms_r4_f.dir/platform/platform.F90.o Using 8-byte addressing Using pure routines. Using allocatable derived type array members. Using cray pointers. [ 13%] Building Fortran object CMakeFiles/fms_r4_f.dir/mpp/mpp_parameter.F90.o [ 14%] Building Fortran object CMakeFiles/fms_r4_f.dir/mpp/mpp_data.F90.o [ 15%] Building Fortran object CMakeFiles/fms_r4_f.dir/mpp/mpp.F90.o [ 16%] Building Fortran object CMakeFiles/fms_r4_f.dir/constants/fmsconstants.F90.o [ 17%] Building Fortran object CMakeFiles/fms_r4_f.dir/constants/constants.F90.o [ 18%] Building Fortran object CMakeFiles/fms_r4_f.dir/string_utils/fms_string_utils.F90.o [ 19%] Building Fortran object CMakeFiles/fms_r4_f.dir/mpp/mpp_efp.F90.o [ 20%] Building Fortran object CMakeFiles/fms_r4_f.dir/mpp/mpp_memutils.F90.o [ 21%] Building Fortran object CMakeFiles/fms_r4_f.dir/mpp/mpp_domains.F90.o [ 22%] Building Fortran object CMakeFiles/fms_r4_f.dir/fms2_io/fms_io_utils.F90.o [ 23%] Building Fortran object CMakeFiles/fms_r4_f.dir/fms2_io/netcdf_io.F90.o

0 0x000000000232d4ea

      #1 0x0000000002394d07
      #2 0x0000000002394e30
      #3 0x00007f568f22fd50
      #4 0x00000000028270fa
      #5 0x000000000282243c
      #6 0x0000000002821a05
      #7 0x000000000281e9af
      #8 0x00000000027db423
      #9 0x00000000026a36dd
     #10 0x00000000026aa9df
     #11 0x00000000026a4e49
     #12 0x00000000022c9ffb
     #13 0x00000000022c7c67
     #14 0x0000000002270263
     #15 0x0000000002452dbe
     #16 0x00007f568f21a29d __libc_start_main + 239
     #17 0x00000000020ab129

/glade/derecho/scratch/jedwards/tmp/ifx1589109924PeUwvK/ifxGFgVwM.i90: error #5633: Internal compiler error: segmentation violation signal raised Please report this error along with the circumstances in which it occurred in a Software Problem Report. Note: File and line given may not be explicit cause of this error. compilation aborted for /glade/derecho/scratch/jedwards/fmsbug/FMS/fms2_io/netcdf_io.F90 (code 3) make[2]: [CMakeFiles/fms_r4_f.dir/build.make:699: CMakeFiles/fms_r4_f.dir/fms2_io/netcdf_io.F90.o] Error 3 make[1]: [CMakeFiles/Makefile2:113: CMakeFiles/fms_r4_f.dir/all] Error 2 make: *** [Makefile:136: all] Error 2

jedwards4b commented 4 weeks ago

I found a work around for CESM In Depends.intel-oneapi:

# FMS objects that ICE with -O2                                                                                                                           
REDUCED_OPT_OBJS=\                                                                                                                                        
netcdf_io.o \                                                                                                                                             
fms_netcdf_domain_io.o \                                                                                                                                  
fms_netcdf_unstructured_domain_io.o   

  $(REDUCED_OPT_OBJS): %.o: %.F90                                                                                                                         
          $(FC) -c $(INCLDIR) $(INCS) $(FFLAGS) $(FREEFLAGS)  -O0 $<         
dphow commented 4 weeks ago

Copying from the HPC Jira ticket response

Was this a recent observation that originated in the latest commit, ie https://github.com/ESCOMP/FMS/commit/18cb810fbb313609c2d769015c03d3f968fb3ecf

or has this likely been an ongoing problem when using the intel-oneapi LLVM compilers?

I see the workaround suggested is to use -O0 optimization (ie no optimization). Was there a previous commit where -O2 did work with intel-oneapi? Have you tried any other OneAPI suites to test this as well?

I am not sure to what extent Intel compiler folks will want a report using the whole model as a reproducer but if there's a smaller test case, or if this issue is within solely a netcdf interface as the error currently suggests, then perhaps we can share with them a smaller MRE?

That is assuming there wasn't some code change that can be explicitly identified causing this. Nonetheless, these types of errors are often best reported to Intel compiler folks. Hopefully, we can find better work arounds in the interim.

jedwards4b commented 4 weeks ago

@dphow It's been an ongoing problem, same issue in intel-oneapi/2023.2.1. This is not an entire model this is one support library.

dphow commented 3 weeks ago

Copying here from Jira ticket...

Can you try the new intel/2024.2.1 module to see if this issue remains in this updated version?

We had the thought that new compilers Intel has released may have already addressed this, as per this past issues of theirs https://community.intel.com/t5/Intel-Fortran-Compiler/Internal-compiler-error-segmentation-violation-signal-raised-WRF/td-p/1575801

jedwards4b commented 3 weeks ago

@dphow It does seem to have solved the problem. Thanks!