ESMCI / ccs_config_cesm

CESM CIME Case Control System configuration files
3 stars 41 forks source link

Builds of mpi-serial case with intel and DEBUG on are failing on Derecho #130

Closed ekluzek closed 6 months ago

ekluzek commented 7 months ago

I'm getting a fail in the build of mpi-serial cases with the intel compiler and DEBUG on are failing on Derecho in ccs_config_cesm0.0.84 with ctsm5.1.dev156-43-g84bab54dc in what will become ctsm5.1.dev157 (https://github.com/ESCOMP/CTSM/pull/2269).

Two test cases that fail are:

ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro

The build fails at the link step as follows with undefined references to MPI for mpich. Which is odd because this is built with mpi-serial, so mpich shouldn't be anywhere in here.

model_only is True
         - Building atm Library
Building atm with output to /glade/derecho/scratch/erik/tests_ctsm51d155derechofs/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.GC.ctsm51d155derechofs_int/bld/atm.bldlog.231201-010530
datm built in 0.957645 seconds
Building cesm from /glade/work/erik/ctsm_worktrees/external_updates/components/cmeps/cime_config/buildexe with output to /glade/derecho/scratch/erik/tests_ctsm51d155derechofs/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.GC.ctsm51d155derechofs_int/bld/cesm.bldlog.231201-010530
Component cesm exe build complete with 43 warnings
Building test for ERS in directory /glade/derecho/scratch/erik/tests_ctsm51d155derechofs/ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold.GC.ctsm51d155derechofs_int
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_strerror@FABRIC_1.0'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_fabric@FABRIC_1.1'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_getinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_dupinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_freeinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_version@FABRIC_1.0'

I can see references for mpich in my software_env.txt for my case, which seems odd...

software_environment.txt:LMOD_SYSTEM_DEFAULT_MODULES=ncarenv/23.09:craype/2.7.23:intel/2023.2.1:ncarcompilers/1.0.0:cray-mpich/8.1.27:netcdf/4.9.2
software_environment.txt:PBS_O_PATH=/glade/u/apps/derecho/23.06/spack/opt/spack/netcdf/4.9.2/oneapi/2023.0.0/iijr/bin:/glade/u/apps/derecho/23.06/spack/opt/spack/hdf5/1.12.2/oneapi/2023.0.0/d6xa/bin:/glade/u/apps/derecho/23.06/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2023.0.0/ec7b/bin/mpi:/opt/cray/pe/pals/1.2.11/bin:/opt/cray/libfabric/1.15.2.0/bin:/opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/bin:/opt/cray/pe/mpich/8.1.25/bin:/glade/u/apps/derecho/23.06/spack/opt/spack/ncarcompilers/1.0.0/oneapi/2023.0.0/ec7b/bin:/glade/u/apps/common/23.04/spack/opt/spack/intel-oneapi-compilers/2023.0.0/compiler/2023.0.0/linux/lib/oclfpga/bin:/glade/u/apps/common/23.04/spack/opt/spack/intel-oneapi-compilers/2023.0.0/compiler/2023.0.0/linux/bin/intel64:/glade/u/apps/common/23.04/spack/opt/spack/intel-oneapi-compilers/2023.0.0/compiler/2023.0.0/linux/bin:/opt/cray/pe/craype/2.7.20/bin:/glade/u/apps/derecho/23.06/opt/utils/bin:/opt/clmgr/sbin:/opt/clmgr/bin:/opt/sgi/sbin:/opt/sgi/bin:/glade/u/home/erik/bin:/usr/sbin:/opt/c3/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/pbs/bin:/glade/u/apps/derecho/23.06/opt/bin:/usr/local/bin:/usr/bin:/sbin:/bin:/opt/cray/pe/bin
ekluzek commented 7 months ago

I thought the mpich part of this might have been from my modules environment before running a case. But, it looks like that isn't the case, as both cesmdev and ncarenv seem to add in mpich at least to the LMOD_SYSTEM_DEFAULT_MODULES env variable. Unsetting that env variable beforehand doesn't help as they both set it for you.

The module purge in env_mach_specifc.xml doesn't completely unload the users environment for the modules they loaded that are sticky.

ekluzek commented 7 months ago

OK, I ran a production case that worked and a debug one that failed. In comparing the link step between the two I think the key difference is the PIO library here...

< -L/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
---
> -L/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-ztld6j4qg5warlaaek3eql6bo2mlq4bm/lib

The first one in the filename includes a directory with mpi-serial explicitly, while the second does not. So I hacked the Makefile to use the PIO library from the working one, that still failed. But, when I also hacked the Makefile to use the ESMF library from the non-debug version -- I got it to work.

So using the non-debug ESMF and PIO versions allow the code to compile.

Here's the difference in the hacked Makefile to show what I did to make it work

 diff -c Tools/Makefile.orig Tools/Makefile
*** Tools/Makefile.orig 2023-12-02 13:12:40.356436000 -0700
--- Tools/Makefile      2023-12-02 15:21:35.517732412 -0700
***************
*** 260,265 ****
--- 260,266 ----
     SLIBS += -L$(LIB_PNETCDF) -lpnetcdf
  endif

+ ESMFMKFILE := /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/esmf-8.6.0b04-kvqb7p62vw5d6dgsbyhnh6j2esucma2t/lib/esmf.mk
  # Set esmf.mk location with ESMF_LIBDIR having precedence over ESMFMKFILE
  CIME_ESMFMKFILE := undefined_ESMFMKFILE
  ifdef ESMFMKFILE
***************
*** 446,451 ****
--- 447,453 ----
    MCT_LIBDIR=$(INSTALL_SHAREDPATH)/lib
  endif

+ PIO_LIBDIR := /glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
  ifdef PIO_LIBDIR
    ifeq ($(PIO_VERSION),$(PIO_VERSION_MAJOR))
      INCLDIR += -I$(PIO_INCDIR)

So it sounds like the ESMF and PIO libraries with DEBUG on for intel, must have issues and aren't really using mpi-serial. At least maybe in the module environment?

jedwards4b commented 7 months ago

mpi-serial is an installed module on derecho, but there is a problem with the install as mpi.mod is missing. I have opened https://github.com/NCAR/spack-derecho/issues/18 for cisl to correct that problem. This will also require a cime PR and a ccs_config PR - coming soon.

jedwards4b commented 7 months ago

After fixing the mpi-serial install I am still getting the error

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_strerror@FABRIC_1.0'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_fabric@FABRIC_1.1'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_getinfo@FABRIC_1.3'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_dupinfo@FABRIC_1.3'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_freeinfo@FABRIC_1.3'

ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_version@FABRIC_1.0'

Still trying to understand why.

jedwards4b commented 7 months ago

If I simplify to SMS_Mmpi-serial.f19_g17.X.derecho_intel it works.
SMS_Mmpi-serial.f19_g17.A.derecho_intel also builds correctly. I tried SMS_Mmpi-serial.f19_g17.2000_DATM%CRUv7_CLM50%BGC_SICE_SOCN_SROF_SGLC_SWAV_SESP.derecho_intel and it also fails.

ekluzek commented 7 months ago

I tried in CTSM with

ccs_config_cesm0.0.85 cime6.0.193

and it's still failing for me. What set of externals did you use in ESMCI/cime#4533 that you got to work?

Also note that DEBUG off tests were working for me it's DEBUG on tests that fail. so do debug tests for the X and A compsets work?

So

SMS_D_Mmpi-serial.f19_g17.X.derecho_intel

and

SMS_D_Mmpi-serial.f19_g17.A.derecho_intel

?

ekluzek commented 6 months ago

I tried again with the latest following externals and it's still failing:

@@ -34,7 +34,7 @@ hash = 34723c2
 required = True

 [ccs_config]
-tag = ccs_config_cesm0.0.84
+tag = ccs_config_cesm0.0.87
 protocol = git
 repo_url = https://github.com/ESMCI/ccs_config_cesm.git
 local_path = ccs_config
@@ -44,11 +44,11 @@ required = True
 local_path = cime
 protocol = git
 repo_url = https://github.com/ESMCI/cime
-tag = cime6.0.175
+tag = cime6.0.198
 required = True

 [cmeps]
-tag = cmeps0.14.43
+tag = cmeps0.14.47
 protocol = git
 repo_url = https://github.com/ESCOMP/CMEPS.git
 local_path = components/cmeps