ESCOMP / CTSM

Community Terrestrial Systems Model (includes the Community Land Model of CESM)
http://www.cesm.ucar.edu/models/cesm2.0/land/
Other
310 stars 313 forks source link

Nag compiler mpi-serial tests are failing on Izumi after the hardware rebuild #2861

Open ekluzek opened 2 weeks ago

ekluzek commented 2 weeks ago

Brief summary of bug

The Nag compiler mpi-serial tests are now failing on Izumi after the hardware rebuild. This is due to having trouble finding the Nag NetCDF shared library at runtime.

General bug information

CTSM version you are using: ctsm5.2.009 (likely applies to all other baselines before it as well) Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: nag mpi-serial cases (use mvapich for MPI to get around this)

Details of bug

The Nag compiler mpi-serial tests are now failing for baselines created after the hardware update. So the original baselines worked fine -- but now they fail with a runtime error.

Here's the list of tests that now fail:

ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate (RUN)
ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.izumi_nag.clm-FatesCold (RUN)
SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.izumi_nag.clm-ptsRLA (RUN)
SMS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60FatesRs.izumi_nag.clm-FatesCold (RUN)
SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60SpRs.izumi_nag.clm-default--clm-NEON-TOOL (RUN)

Important output or errors that show the problem

Here's the cesm log:

cat /scratch/cluster/erik/ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7/run/cesm.log.698815.izumi.cgd.ucar.edu.241106-121226/scratch/cluster/erik/ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7/run/cesm.log.698815.izumi.cgd.ucar.edu.241106-121226
cat: /scratch/cluster/erik/ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7/run/cesm.log.698815.izumi.cgd.ucar.edu.241106-121226/scratch/cluster/erik/ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7/run/cesm.log.698815.izumi.cgd.ucar.edu.241106-121226: Not a directory
(ctsm_pylib) [erik@izumi ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7]$ cat /scratch/cluster/erik/ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7/run/cesm.log.698815.izumi.cgd.ucar.edu.241106-121226 
/scratch/cluster/erik/ERS_D_Ld5_Mmpi-serial.1x1_vancouverCAN.I1PtClm50SpRs.izumi_nag.clm-CLM1PTStartDate.20241106_110939_baypw7/bld/cesm.exe: error while loading shared libraries: libnetcdf.so.13: cannot open shared object file: No such file or directory
olyson commented 2 weeks ago

@ekluzek , this test fails for me due to the nag netcdf library problem. I didn't see this in your list above.

SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Bgc.izumi_nag.clm-default--clm-NEON-HARV (RUN)

ekluzek commented 2 weeks ago

Hmmm, I don't see that one in my ctsm5.3.009 testing, maybe that test is just on b4b-dev? In any case -- anything that's izumi_nag with mpi-serial is going to fail so belongs under this issue.

ekluzek commented 1 week ago

@jedwards4b has a fix to ccs_config that we'll bring in that will solve this for the latest tags when we bring in the submodule update with it. So latest dev tags will be OK.

For older tags I worked with Joseph and we pushed a simple safe change of a symlink for the netcdf mpi-serial library that allows these tests to work.

I'll do more tests with older tags to ensure this is the case -- but this should handle this both for old and new tags.