CABLE-LSM / CABLE

Home to the CABLE land surface model and its documentation
https://cable.readthedocs.io/en/latest/
Other
12 stars 6 forks source link

Floating-point divide by zero exception in `ssnow%smp_hys` computation #396

Open SeanBryan51 opened 2 months ago

SeanBryan51 commented 2 months ago

Hacking a temporary fix for https://github.com/CABLE-LSM/CABLE/issues/395 and running CABLE-MPI offline (main branch - commit 95b9b5e915581b0d0ed0ed407573cb448770c7b4) using the crujra_accessN96_1h configuration results in the following divide by zero exception:

[gadi-cpu-clx-2663:671237:0:671237] Caught signal 8 (Floating point exception: floating-point divide by zero)
==== backtrace (tid: 671237) ====
 0 0x0000000000012d20 __funlockfile()  :0
 1 0x00000000006c2338 cable_param_module_mp_derived_parameters_()  /home/189/sb8430/cable/src/offline/cable_parameters.F90:2320
 2 0x00000000006228a2 cable_input_module_mp_load_parameters_()  /home/189/sb8430/cable/src/offline/cable_input.F90:2829
 3 0x0000000000416cd2 cable_mpimaster_mp_mpidrv_master_()  /home/189/sb8430/cable/src/offline/cable_mpimaster.F90:601
 4 0x000000000040e5fc MAIN__()  /home/189/sb8430/cable/src/offline/cable_mpidrv.F90:54
 5 0x000000000040da22 main()  ???:0
 6 0x000000000003a7e5 __libc_start_main()  ???:0
 7 0x000000000040d92e _start()  ???:0
=================================
forrtl: error (75): floating point exception
Image              PC                Routine            Line        Source             
cable-mpi          0000000000C7E474  Unknown               Unknown  Unknown
libpthread-2.28.s  0000155548397D20  Unknown               Unknown  Unknown
cable-mpi          00000000006C2338  cable_param_modul        2320  cable_parameters.F90
cable-mpi          00000000006228A2  cable_input_modul        2829  cable_input.F90
cable-mpi          0000000000416CD2  cable_mpimaster_m         601  cable_mpimaster.F90
cable-mpi          000000000040E5FC  MAIN__                     54  cable_mpidrv.F90
cable-mpi          000000000040DA22  Unknown               Unknown  Unknown
libc-2.28.so       0000155547C677E5  __libc_start_main     Unknown  Unknown
cable-mpi          000000000040D92E  Unknown               Unknown  Unknown

The exception occurs on this line of the code:

https://github.com/CABLE-LSM/CABLE/blob/95b9b5e915581b0d0ed0ed407573cb448770c7b4/src/offline/cable_parameters.F90#L2320

It looks like ssnow%ssat_hys(i,k) and ssnow%watr_hys(i,k) are both uninitialised and contain the same garbage value, causing the subtraction of the two values to result in divide by zero.

Steps to reproduce (Gadi)

Apply the following patch to fix the error described in https://github.com/CABLE-LSM/CABLE/issues/395 (WARNING - this patch is untested and should not be used for work other than reproducing this issue):

diff --git a/src/offline/cable_parameters.F90 b/src/offline/cable_parameters.F90
index b6133f6..c741eaf 100644
--- a/src/offline/cable_parameters.F90
+++ b/src/offline/cable_parameters.F90
@@ -3340,11 +3340,11 @@ CONTAINS
     totdepth = 0.0
     DO is = 1, ms-1
        totdepth = totdepth + soil_zse(is) * 100.0  ! unit in centimetres
-       veg%froot(:, is) = MIN( 1.0, 1.0-veg%rootbeta(:)**totdepth )
+       veg%froot(ifmp:fmp, is) = MIN( 1.0, 1.0-veg%rootbeta(ifmp:fmp)**totdepth )
     END DO
-    veg%froot(:, ms) = 1.0 - veg%froot(:, ms-1)
+    veg%froot(ifmp:fmp, ms) = 1.0 - veg%froot(ifmp:fmp, ms-1)
     DO is = ms-1, 2, -1
-       veg%froot(:, is) = veg%froot(:, is)-veg%froot(:,is-1)
+    veg%froot(ifmp:fmp, is) = veg%froot(ifmp:fmp, is)-veg%froot(ifmp:fmp,is-1)
     END DO

   END SUBROUTINE init_veg_from_vegin

The steps to reproduce the error are the same as that described in https://github.com/CABLE-LSM/CABLE/issues/395.

SeanBryan51 commented 6 days ago

@rkutteh @ccarouge FYI this issue looks like it is related to the GW work.

Currently all ssnow%*_hys variables are uninitialised causing the exception. It looks like initialisation of some ssnow%*_hys variables occur in the subroutine GWspatialParameters here:

https://github.com/CABLE-LSM/CABLE/blob/ca4c13e59aacbbe8a5fc2f09b97a04270f6a82f8/src/offline/cable_parameters.F90#L3563-L3566

Note: GWspatialParameters does not seem to initialise the ssnow%sucs_hys or ssnow%wb_hys variables.

For the next GW changes, are there plans to remove the problematic code, i.e:

https://github.com/CABLE-LSM/CABLE/blob/ca4c13e59aacbbe8a5fc2f09b97a04270f6a82f8/src/offline/cable_parameters.F90#L2296-L2312

or ensure all ssnow%*_hys variables are initialised?

rkutteh commented 6 days ago

@SeanBryan51 @ccarouge As Claire already knows, I have fixed all these bugs in my GW branch that is now in the process of making its way into the trunk. My own view is to wait a bit until this process is finished (this month I think) so as to avoid reinventing the wheel. Just for the record, I had compiled my GW branch with "check all" and fixed every bug it flagged.

SeanBryan51 commented 6 days ago

@rkutteh -check and -ftrapuv are not 100% reliable in finding uninitialised vars (see this talk for more info). Runtime memory checking tools are more robust. I have been using ddt with memory debug settings enabled which I recommend. It is easy to run CABLE with ddt on Gadi using offline debugging:

module load linaro-forge/24.0.2
ddt --offline --mem-debug=balanced mpiexec -n <NCPUS> ./cable-mpi

Happy to share more details if you are interested