CABLE-LSM / CABLE

Home to the CABLE land surface model and its documentation
https://cable.readthedocs.io/en/latest/
Other
12 stars 6 forks source link

CABLE offline spatial runs are not reproducible #463

Open SeanBryan51 opened 2 weeks ago

SeanBryan51 commented 2 weeks ago

Currently for MPI configurations (ran via benchcab spatial), running the same version of CABLE against itself sometimes does not produce the same output (bitwise).

Benchcab currently tests 4 different science configurations for all given CABLE versions, labelled S0, S1, S2 and S3. Sometimes one or more of these configurations will reproduce the same results bitwise however it is unlikely all configurations reliably reproduce.

Where differences occur, many variables have relative differences greater than 10% throughout the time series.

My guess as to why this is happening is uninitialised memory access somewhere (e.g. #395, #396, #397) is causing non-deterministic behaviour. Currently the MPI executable crashes when running it with ddt with balanced memory debugging settings enabled.

Steps to reproduce (Gadi):

CABLE version used: main c125ede1eb9e7881e8bce72563992e1d43c685fe Benchcab version used: 4.1.0

  1. Clone the bench_example repository into /scratch:
  2. Change directory into bench_example and set the configuration file as follows:
    
    cat << EOF > config.yaml
    realisations:
    - repo:
      git:
        branch: main
    - repo:
      git:
        branch: main
    name: main-2

modules: [ intel-compiler/2021.1.1, netcdf/4.7.4, openmpi/4.1.0 ] EOF

5. Load hh5 modules and run `benchcab spatial`:

module load conda/analysis3-24.04 benchcab spatial

7. Wait for CABLE jobs to finish.
8. Load `nccmp` and compare outputs:

module load nccmp nccmp -d runs/spatial/tasks/crujra_access_R_S0/archive/output000/cable_out.nc nccmp -d runs/spatial/tasks/crujra_access_R_S1/archive/output000/cable_out.nc nccmp -d runs/spatial/tasks/crujra_access_R_S2/archive/output000/cable_out.nc nccmp -d runs/spatial/tasks/crujra_access_R_S3/archive/output000/cable_out.nc