HPSCTerrSys / eCLM

Fork of Community Land Model v5.0
https://hpscterrsys.github.io/eCLM/
Other
8 stars 13 forks source link

eCLM-PDAF with DEBUG: Index error in array `lnd_resume` in `lnd_comp_mct` #41

Closed jjokella closed 2 months ago

jjokella commented 2 months ago

Compiling eCLM-PDAF in TSMP2 with the option -DCMAKE_BUILD_TYPE="DEBUG" generates the following error (in at least two testcases for eCLM-PDAF):

forrtl: severe (408): fort: (3): Subscript #1 of the array LND_RESUME has value 0 which is less than the lower bound of 1

Image              PC                Routine            Line        Source
tsmp-pdaf          0000000003DF751F  Unknown               Unknown  Unknown
tsmp-pdaf          0000000000918699  lnd_comp_mct_mp_l         723  lnd_comp_mct.F90
tsmp-pdaf          0000000000911B0E  lnd_comp_mct_mp_l         275  lnd_comp_mct.F90
tsmp-pdaf          00000000005BBEFB  component_mod_mp_         267  component_mod.F90
tsmp-pdaf          000000000057F6CA  cime_comp_mod_mp_        1285  cime_comp_mod.F90
tsmp-pdaf          000000000056894B  Unknown               Unknown  Unknown
tsmp-pdaf          000000000056865B  Unknown               Unknown  Unknown
tsmp-pdaf          000000000041F024  Unknown               Unknown  Unknown
tsmp-pdaf          0000000000419B62  Unknown               Unknown  Unknown
libc-2.28.so       0000153CF6C8E7E5  __libc_start_main     Unknown  Unknown
tsmp-pdaf          0000000000419A6E  Unknown               Unknown  Unknown

Here is the code:

https://github.com/HPSCTerrSys/eCLM/blob/b88252465bd67f1392cd5c8adf97fa91a803c405/src/clm5/cpl/lnd_comp_mct.F90#L701-L734

Current idea: lnd_resume is an array that should start with index 1. However, inst_index starts at 0. Therefore, the debug-compiled version throws an error.

In the RELEASE version, the code runs through. However, it leads to strange behavior: resume_from_DA T will be output at every time step for inst_index = 0.

Adding some debug output printed the following in lnd_0000.log:

 eCLM(lnd_comp_mct): iam, lnd_resume(min(num_inst_lnd,inst_index))            0
 8^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@8^@^@^@^@^@^@^@8^@^@^@^@^@^@^@^@P\372\255\375^?^@^@`\246A^A^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@\375^?^@
 ^@^@^@^@^@^@^@^@^@\332&4^A^@^@^@^@0Q\372\255\375^?^@^@`\233T^H^@^@^@^@^@S\372\255\375^?^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
 ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@\205\203 ^A^@^@@YCG^ND\300a ^A^@^@\375^?^@^@^@^@^@^@^@^@^@^@PM\372\255\375^?^@^@\200\270\356\201\220^T^@^@\310\260a^C^@^@^@^@a^@^@^@^@
 ^@^@^@@T\372\255\375^?^@^@O[c^@^@^@^@^@
eCLM(lnd_comp_mct): iam, len(lnd_resume(min(num_inst_lnd,inst_index)))          0       256
eCLM(lnd_comp_mct): iam, len_trim(lnd_resume(min(num_inst_lnd,inst_index)))          0       256
 resume_from_DA  T

This indicates to my knowledge that lnd_resume(0) is interpreted as a dummy string and therefore every time step is falsely set as an "after-DA" time step.

@kvrigor I wanted to ask your help here. I am thinking about introducing a quick fix by using inst_index + 1. Do you think this is something to take to CLM5 even? In newer versions, lnd_resume does not appear anymore, but in release_clm5, this "error" exists.

kvrigor commented 2 months ago

@jjokella: I think the issue here is that min(num_inst_lnd,inst_index) returns 0 which shouldn't be possible. Let's look closer into these two variables:

I suspect seq_comm_inst(LNDID) returns 0. Based from the code above, this happens when LNDID is outside of the range [1, ncomps]. This could be a good place to probe further. Also I'd suggest to verify if our assumptions above match with a running model.

jjokella commented 2 months ago

Hey @kvrigor , thanks for looking into the issue!

Also I'd suggest to verify if our assumptions above match with a running model.

The debug output above in lnd_0000.log is coming from a running eCLM-PDAF model compiled as RELEASE. Only when it is compiled as DEBUG, will the error show.

Regarding LNDID, I think this should be the suffix of the log and our suffixes start at 0. They are also given out in the atmospheric log atm_0000.log:

(datm_comp_init)  decomp   = 1d
(datm_comp_init)  iradsw   =        1
(datm_comp_init)  factorFn = null
(datm_comp_init)  restfilm = undefined
(datm_comp_init)  restfils = undefined
(datm_comp_init)  presaero =  T
(datm_comp_init)  force_prognostic_true =  F
(datm_comp_init)  wiso_datm   =  F
(datm_comp_init) inst_index  =         0
(datm_comp_init) inst_name   =  ATM0000
(datm_comp_init) inst_suffix =  _0000

Maybe it is just not correct to start the instances with 0? But it is such a small place, where this leads to an error, everything else runs fine with inst_index = 0.

I am checking now debug output of LNDID and inst_index around the line that sets the inst_index https://github.com/HPSCTerrSys/eCLM/blob/b88252465bd67f1392cd5c8adf97fa91a803c405/src/clm5/cpl/lnd_comp_mct.F90#L142

kvrigor commented 2 months ago

Based from old logs (1, 2), it seems to be that inst_index=1:

(datm_comp_init)  decomp   = 1d
(datm_comp_init)  iradsw   =        1
(datm_comp_init)  factorFn = null
(datm_comp_init)  restfilm = undefined
(datm_comp_init)  restfils = undefined
(datm_comp_init)  presaero =  T
(datm_comp_init)  force_prognostic_true =  F
(datm_comp_init)  wiso_datm   =  F
(datm_comp_init) inst_index  =         1
(datm_comp_init) inst_name   =  ATM
(datm_comp_init) inst_suffix =  

Could inst_index be possibly inherited from pdaf_id here?

https://github.com/HPSCTerrSys/eCLM/blob/370e56f28b111699693bed5abcc5e580fca978d0/src/csm_share/mct/seq_comm_mct.F90#L616-L620

If so, one fix could be to re-index pdaf_id starting from 1 to conform to Fortran 1-based arrays. Another approach is to increment inst_index only within the context of USE_PDAF; however this approach might be prone to other side effects.

jjokella commented 2 months ago

Ah yes, this is it, thanks @kvrigor !!

Now, I have to think about what to do. I think this should be fixed, such that pdaf_id starts at 1 as you suggest.

Only downside is that this is not backward-compatible with existing eCLM-PDAF inputs (because of the change of suffixes).

jjokella commented 2 months ago

Closing this issue here, as it should be fixed in the PDAF-interface of eCLM-PDAF.

I will currently not implement a hard-coded way of forcing pdaf_id to start at 1, as this breaks existing eCLM-PDAF setups by indexing. This issue becomes even bigger, when looking at CLM3.5-PDAF and coupled ParFlow-CLM-PDAF. In the future, the indexing may be refactored to start at 1 by default for all of these cases.

Currently, my approach is to add documentation and suggest users to let pdaf_id start at 1 using the existing input DA:startreal in enkpf.par (https://hpscterrsys.github.io/TSMP/content/setup_tsmp/input_enkfpf.html#da-startreal).