CABLE-LSM / benchcab

Tool for evaluation of CABLE land surface model
https://benchcab.readthedocs.io/en/latest/
Apache License 2.0
2 stars 3 forks source link

CABLE linked against incorrect libraries when run from the hh5 conda environment #279

Closed SeanBryan51 closed 5 months ago

SeanBryan51 commented 5 months ago

CABLE is linked against incorrect libraries for netcdf and MPI when running benchcab (v4.0.2) from the hh5 conda environment.

The following behaviour only occurs when running from the hh5 conda environment and not when running from the benchcab-dev environment.

Steps to reproduce:

module use /g/data/hh5/public/modules
module load conda/analysis3-unstable
git clone https://github.com/CABLE-LSM/bench_example.git
cd bench_example
cat > config.yaml << EOL
project: $PROJECT
realisations:
  - repo:
      git:
        branch: main
        commit: 46830b3773f1932680af158cab27ae223fd8685a
fluxsite:
  experiment: AU-Tum
modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]
EOL
benchcab checkout -v && benchcab build --mpi -v

Running the above causes the build to fail when compiling the MPI executable:

[ 96%] Building Fortran object CMakeFiles/cable-mpi.dir/src/science/pop/pop_mpi.F90.o
/scratch/tm70/sb8430/bench_example/src/main/src/science/pop/pop_mpi.F90(25): error #7013: This module file was not generated by any release of this compiler.   [MPI]
    USE MPI
--------^
...
compilation aborted for /scratch/tm70/sb8430/bench_example/src/main/src/science/pop/pop_mpi.F90 (code 1)

Output from CMake shows that we are linking against an MPI library found in the conda environment:

-- Found MPI_Fortran: /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/libmpi_usempif08.so (found version "3.1")

when the MPI_Fortran path should instead be pointing to /apps/openmpi/4.1.0/lib/....

The serial executable compiles successfully but is linked against the netcdf-fortran library found in the conda environment:

$ ldd src/main/bin/cable | grep netcdff
    libnetcdff.so.7 => /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib/libnetcdff.so.7 (0x00007fe324c17000)

when this should point to /apps/netcdf/4.7.4/lib/....

Running the serial executable crashes due to undefined symbols from the netcdf library:

module use /g/data/hh5/public/modules
module load conda/analysis3-unstable
git clone https://github.com/CABLE-LSM/bench_example.git
cd bench_example
cat > config.yaml << EOL
project: $PROJECT
realisations:
  - repo:
      git:
        branch: main
        commit: 46830b3773f1932680af158cab27ae223fd8685a
fluxsite:
  experiment: AU-Tum
modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]
EOL
benchcab fluxsite -v

The PBS job script outputs:

2024-04-11 12:02:06,251 - DEBUG - fluxsite.fluxsite.py:242 - Error: CABLE returned an error for task AU-Tum_2002-2017_OzFlux_Met_R0_S0

Inspecting the standard output from CABLE:

$ cat runs/fluxsite/tasks/AU-Tum_2002-2017_OzFlux_Met_R0_S0/out.txt
./cable: symbol lookup error: ./cable: undefined symbol: netcdf_mp_nf90_inquire_variable_
ccarouge commented 5 months ago

Sounds like when we think of a release of CABLE, we may want to release benchcab independently of hh5 environments... Still need to think on this one but this is annoying.

SeanBryan51 commented 5 months ago

The issue is due to environment variables being set which affect the behaviour the build, notably LDFLAGS and CMAKE_PREFIX_PATH:

$ module load conda/analysis3-unstable
$ echo $LDFLAGS
-Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib -Wl,-rpath-link,/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib -L/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/lib
$ echo $CMAKE_PREFIX_PATH
/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01:/g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/x86_64-conda-linux-gnu/sysroot/usr

A quick fix would be to unset these variables before invoking CMake.

SeanBryan51 commented 5 months ago

@dsroberts I noticed there are other environment variables being set when loading conda/analysis3-unstable which may impact build systems (e.g. CC, CFLAGS, CPPFLAGS, ...). It seems strange that these variables are being exported to the user environment. Do you know where these variables are coming from?

dsroberts commented 5 months ago

Hi @SeanBryan51 Yep. These come from the environment activation script for gcc_linux-64 found here: /g/data/hh5/public/apps/miniconda3/envs/analysis3-24.01/etc/conda/activate.d/activate-gcc_linux-64.sh. gcc_linux-64 is bought in by parcels, which is a dependency of some COSIMA recipes, so it can't just be removed. The conda module works by running conda activate in a 'blank' environment and parsing the output of the env command. There is some level of filtering, but I'm not sure we can assume that no one ever wants to build against the analysis3 environments. What you've done with passing env to subprocess.run is probably the most sensible solution, though I think rather than removing the LDFLAGS and CMAKE_PREFIX_PATH environment variables entirely, you could create copies of them with references to /g/data/hh5/... removed, then pass those to subprocess.run.