Closed ccarouge closed 3 months ago
Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules.
I wasn't able to reproduce this - I found differences in the output when building and running CABLE outside of benchcab.
I narrowed down the differences in output to the following commit: https://github.com/CABLE-LSM/CABLE/pull/346/commits/0a69346fc8ed8cda575ce612481bf3929ddea2a3.
See here for the commit history of 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base
.
This does not make any sense. The commit you highlighted changes the calculation of canopy%epot
. This variable is only used for canopy%wetfac_cs
which is not used in standalone, only in the coupled model. So changing the equation to calculate canopy%epot
should not change the results at all in standalone!
And the potential evaporation (epot) was not part of the outputs before that branch so the only change in the output we should see is an additional variable in the file. All other variables should be the same.
I tried running a debugger on the AU-Tum fluxsite configuration and I now seem to be getting floating point overflow error:
forrtl: error (72): floating overflow
Image PC Routine Line Source
cable 0000000000B357D4 Unknown Unknown Unknown
libpthread-2.28.s 00007FFFEC28DD20 Unknown Unknown Unknown
cable 0000000000717C6D cable_canopy_modu 497 cable_canopy.F90
cable 00000000006BC2FE cable_cbm_module_ 169 cbl_model_driver_offline.F90
cable 000000000041864D MAIN__ 798 cable_driver.F90
cable 000000000040CBA2 Unknown Unknown Unknown
libc-2.28.so 00007FFFEBEDF7E5 __libc_start_main Unknown Unknown
cable 000000000040CAAE Unknown Unknown Unknown
Still not sure if this is related to the original problem, investigating further.
Investigating the above error further, I found that the debug build and release build in the main branch are not bit reproducible in model output for fluxsite tests (tested with commit https://github.com/CABLE-LSM/CABLE/commit/860094b).
The floating point overflow errors were due to uninitialised variables: 1. canopy%DvLitt
and 2. sum_rad_gradis
. Fixing 1 does not change results. Fixing 2 does change results (see https://github.com/CABLE-LSM/CABLE/issues/351).
Fixing the floating point errors restores bit reproducibility between release and debug builds. Applying the fix to commits https://github.com/CABLE-LSM/CABLE/commit/860094b and https://github.com/CABLE-LSM/CABLE/commit/0a69346fc8ed8cda575ce612481bf3929ddea2a3 and doing a comparison shows that the two commits now only differ in model output w.r.t the PotEvap
variable which is expected.
When working on issue #335 in CABLE, the associated benchcab simulations returned numerical precision differences in all variables for the fluxsite experiments, see here.
After investigation, it turns out this is due to loading openmpi when doing serial compilation.
Tests performed
Running benchcab with
main
and#335 branch
returned differences in fluxsite outputs between realisations. Running one of the tasks using a serial compilation ofmain
and#335 branch
done outside benchcab returned no differences between the outputs. These tests were done using thebuild.bash
script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules. These outputs are identical to the outputs ofmain
using benchcab.It turns out if we compile CABLE, serially, using the
build.bash
script but loading an openmpi module (3 versions were tested), then the#335 branch
gives slightly different results to themain
branch. This happens even so the compilation does not use the openmpi module directly, it's probably a difference in some environment variable.What do we want to do?
This is annoying as it may result in false negative results from benchcab.
Do we want to investigate further to identify where the difference in the environment actually is? Is that useful?
Do we want to fix that in benchcab? Would that mean only loading the necessary modules at compilation time or is there another solution?
@SeanBryan51 @bschroeter @abhaasgoyal @Whyborn mentioning you since I'd appreciate some discussion here.