Closed ekluzek closed 6 months ago
I thought the mpich part of this might have been from my modules environment before running a case. But, it looks like that isn't the case, as both cesmdev and ncarenv seem to add in mpich at least to the LMOD_SYSTEM_DEFAULT_MODULES env variable. Unsetting that env variable beforehand doesn't help as they both set it for you.
The module purge in env_mach_specifc.xml doesn't completely unload the users environment for the modules they loaded that are sticky.
OK, I ran a production case that worked and a debug one that failed. In comparing the link step between the two I think the key difference is the PIO library here...
< -L/glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
---
> -L/glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/parallelio-2.6.2-ztld6j4qg5warlaaek3eql6bo2mlq4bm/lib
The first one in the filename includes a directory with mpi-serial explicitly, while the second does not. So I hacked the Makefile to use the PIO library from the working one, that still failed. But, when I also hacked the Makefile to use the ESMF library from the non-debug version -- I got it to work.
So using the non-debug ESMF and PIO versions allow the code to compile.
Here's the difference in the hacked Makefile to show what I did to make it work
diff -c Tools/Makefile.orig Tools/Makefile
*** Tools/Makefile.orig 2023-12-02 13:12:40.356436000 -0700
--- Tools/Makefile 2023-12-02 15:21:35.517732412 -0700
***************
*** 260,265 ****
--- 260,266 ----
SLIBS += -L$(LIB_PNETCDF) -lpnetcdf
endif
+ ESMFMKFILE := /glade/u/apps/cseg/derecho/23.09/spack/opt/spack/linux-sles15-x86_64_v3/oneapi-2023.2.1/esmf-8.6.0b04-kvqb7p62vw5d6dgsbyhnh6j2esucma2t/lib/esmf.mk
# Set esmf.mk location with ESMF_LIBDIR having precedence over ESMFMKFILE
CIME_ESMFMKFILE := undefined_ESMFMKFILE
ifdef ESMFMKFILE
***************
*** 446,451 ****
--- 447,453 ----
MCT_LIBDIR=$(INSTALL_SHAREDPATH)/lib
endif
+ PIO_LIBDIR := /glade/u/apps/derecho/23.09/spack/opt/spack/parallelio/2.6.2/mpi-serial/2.3.0/oneapi/2023.2.1/mdpq/lib
ifdef PIO_LIBDIR
ifeq ($(PIO_VERSION),$(PIO_VERSION_MAJOR))
INCLDIR += -I$(PIO_INCDIR)
So it sounds like the ESMF and PIO libraries with DEBUG on for intel, must have issues and aren't really using mpi-serial. At least maybe in the module environment?
mpi-serial is an installed module on derecho, but there is a problem with the install as mpi.mod is missing. I have opened https://github.com/NCAR/spack-derecho/issues/18 for cisl to correct that problem. This will also require a cime PR and a ccs_config PR - coming soon.
After fixing the mpi-serial install I am still getting the error
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_strerror@FABRIC_1.0'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_fabric@FABRIC_1.1'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_getinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_dupinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_freeinfo@FABRIC_1.3'
ld: /opt/cray/pe/mpich/8.1.25/ofi/intel/19.0/lib/libmpi_intel.so.12: undefined reference to `fi_version@FABRIC_1.0'
Still trying to understand why.
If I simplify to SMS_Mmpi-serial.f19_g17.X.derecho_intel it works.
SMS_Mmpi-serial.f19_g17.A.derecho_intel also builds correctly.
I tried
SMS_Mmpi-serial.f19_g17.2000_DATM%CRUv7_CLM50%BGC_SICE_SOCN_SROF_SGLC_SWAV_SESP.derecho_intel
and it also fails.
I tried in CTSM with
ccs_config_cesm0.0.85 cime6.0.193
and it's still failing for me. What set of externals did you use in ESMCI/cime#4533 that you got to work?
Also note that DEBUG off tests were working for me it's DEBUG on tests that fail. so do debug tests for the X and A compsets work?
So
SMS_D_Mmpi-serial.f19_g17.X.derecho_intel
and
SMS_D_Mmpi-serial.f19_g17.A.derecho_intel
?
I tried again with the latest following externals and it's still failing:
@@ -34,7 +34,7 @@ hash = 34723c2
required = True
[ccs_config]
-tag = ccs_config_cesm0.0.84
+tag = ccs_config_cesm0.0.87
protocol = git
repo_url = https://github.com/ESMCI/ccs_config_cesm.git
local_path = ccs_config
@@ -44,11 +44,11 @@ required = True
local_path = cime
protocol = git
repo_url = https://github.com/ESMCI/cime
-tag = cime6.0.175
+tag = cime6.0.198
required = True
[cmeps]
-tag = cmeps0.14.43
+tag = cmeps0.14.47
protocol = git
repo_url = https://github.com/ESCOMP/CMEPS.git
local_path = components/cmeps
I'm getting a fail in the build of mpi-serial cases with the intel compiler and DEBUG on are failing on Derecho in ccs_config_cesm0.0.84 with ctsm5.1.dev156-43-g84bab54dc in what will become ctsm5.1.dev157 (https://github.com/ESCOMP/CTSM/pull/2269).
Two test cases that fail are:
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesCold SMS_Lm3_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdHydro
The build fails at the link step as follows with undefined references to MPI for mpich. Which is odd because this is built with mpi-serial, so mpich shouldn't be anywhere in here.
I can see references for mpich in my software_env.txt for my case, which seems odd...