CABLE-LSM / benchcab

Tool for evaluation of CABLE land surface model
https://benchcab.readthedocs.io/en/latest/
Apache License 2.0
2 stars 4 forks source link

Artificial differences seen in fluxsite results when openmpi is loaded #304

Closed ccarouge closed 3 months ago

ccarouge commented 3 months ago

When working on issue #335 in CABLE, the associated benchcab simulations returned numerical precision differences in all variables for the fluxsite experiments, see here.

After investigation, it turns out this is due to loading openmpi when doing serial compilation.

Tests performed

Running benchcab with main and #335 branch returned differences in fluxsite outputs between realisations. Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules. These outputs are identical to the outputs of main using benchcab.

It turns out if we compile CABLE, serially, using the build.bash script but loading an openmpi module (3 versions were tested), then the #335 branch gives slightly different results to the main branch. This happens even so the compilation does not use the openmpi module directly, it's probably a difference in some environment variable.

What do we want to do?

This is annoying as it may result in false negative results from benchcab.

Do we want to investigate further to identify where the difference in the environment actually is? Is that useful?

Do we want to fix that in benchcab? Would that mean only loading the necessary modules at compilation time or is there another solution?

@SeanBryan51 @bschroeter @abhaasgoyal @Whyborn mentioning you since I'd appreciate some discussion here.

SeanBryan51 commented 3 months ago

Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules.

I wasn't able to reproduce this - I found differences in the output when building and running CABLE outside of benchcab.

I narrowed down the differences in output to the following commit: https://github.com/CABLE-LSM/CABLE/pull/346/commits/0a69346fc8ed8cda575ce612481bf3929ddea2a3.

See here for the commit history of 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base.

ccarouge commented 3 months ago

This does not make any sense. The commit you highlighted changes the calculation of canopy%epot. This variable is only used for canopy%wetfac_cs which is not used in standalone, only in the coupled model. So changing the equation to calculate canopy%epot should not change the results at all in standalone!

And the potential evaporation (epot) was not part of the outputs before that branch so the only change in the output we should see is an additional variable in the file. All other variables should be the same.

SeanBryan51 commented 3 months ago

I tried running a debugger on the AU-Tum fluxsite configuration and I now seem to be getting floating point overflow error:

forrtl: error (72): floating overflow
Image              PC                Routine            Line        Source             
cable              0000000000B357D4  Unknown               Unknown  Unknown
libpthread-2.28.s  00007FFFEC28DD20  Unknown               Unknown  Unknown
cable              0000000000717C6D  cable_canopy_modu         497  cable_canopy.F90
cable              00000000006BC2FE  cable_cbm_module_         169  cbl_model_driver_offline.F90
cable              000000000041864D  MAIN__                    798  cable_driver.F90
cable              000000000040CBA2  Unknown               Unknown  Unknown
libc-2.28.so       00007FFFEBEDF7E5  __libc_start_main     Unknown  Unknown
cable              000000000040CAAE  Unknown               Unknown  Unknown

Still not sure if this is related to the original problem, investigating further.

SeanBryan51 commented 3 months ago

Investigating the above error further, I found that the debug build and release build in the main branch are not bit reproducible in model output for fluxsite tests (tested with commit https://github.com/CABLE-LSM/CABLE/commit/860094b).

The floating point overflow errors were due to uninitialised variables: 1. canopy%DvLitt and 2. sum_rad_gradis. Fixing 1 does not change results. Fixing 2 does change results (see https://github.com/CABLE-LSM/CABLE/issues/351).

Fixing the floating point errors restores bit reproducibility between release and debug builds. Applying the fix to commits https://github.com/CABLE-LSM/CABLE/commit/860094b and https://github.com/CABLE-LSM/CABLE/commit/0a69346fc8ed8cda575ce612481bf3929ddea2a3 and doing a comparison shows that the two commits now only differ in model output w.r.t the PotEvap variable which is expected.