`LakeHydrologyMod` can produce unrealistic `snow_depth` values

quantheory commented 1 year ago

I'm posting this bug report because of an issue in ELM that likely also affects CTSM, concerning these lines of code:

https://github.com/ESCOMP/CTSM/blob/abc3c27e0710cd5c0fc3548c3ce174889fc6dd70/src/biogeophys/LakeHydrologyMod.F90#L333-L358

In the ELM version of this code, there are two problems that led to infrequent crashes of the model. First, due to an incorrect ordering of the limiters, it is possible for very small negative values of snow_depth to be produced. I believe that this does not affect CTSM due to this commit by @billsacks: https://github.com/ESCOMP/CTSM/commit/3b01f70c6775b2f22b4cae6153b79c8bf0bed9d1

However, the second issue is that if the h2osno value (renamed h2osno_no_layers in CTSM) is very small at the beginning of this section of code, it can produce unrealistically large values of snow_depth, mainly due to division by a near-zero value (which can greatly amplify the impact of small floating-point errors). This occurs very rarely in practice, but that makes it extremely annoying to debug because the error may only show up several decades into a production run. I have verified that this occurs in ELM, and it can probably also happen in the CTSM version.

Xubin Zeng has suggested a fix for this, which is to replace this line:

https://github.com/ESCOMP/CTSM/blob/abc3c27e0710cd5c0fc3548c3ce174889fc6dd70/src/biogeophys/LakeHydrologyMod.F90#L354

with

snow_depth(c) = h2osno_no_layers(c) * min (snow_depth(c)/h2osno_temp, 1._r8/50._r8)

The physical justification here is that h2osno_temp/snow_depth(c) is being treated as a snow density here, and 50 kg/m^2 is an appropriate lower limit for snow density. (In principle, a higher or lower limiting value could be used, or another method of avoiding the divide-by-near-zero. The important thing is not to use extremely small values of h2osno_temp when estimating snow density.)

wwieder commented 1 year ago

Thanks for making us aware of this issue. We'll talk it over at our weekly meeting.

billsacks commented 1 year ago

Yes, thank you @quantheory ! I know it takes extra effort to cross-check and communicate with both modeling groups, and I appreciate you making the effort to do so!

daniele-peano commented 1 year ago

We had a similar issue in the same routine, but in the previous lines:

https://github.com/ESCOMP/CTSM/blob/abc3c27e0710cd5c0fc3548c3ce174889fc6dd70/src/biogeophys/LakeHydrologyMod.F90#L290-L315

When we have a small amount of initial liquid water in soil, the evaporation become negative. We tried this fix it in this way:

      if (qflx_evap_soi(p) >= 0._r8) then
         ! for evaporation partitioning between liquid evap and ice sublimation, 
         ! use the ratio of liquid to (liquid+ice) in the top layer to determine split
         ! Since we're not limiting evap over lakes, but still can't remove more from top
         ! snow layer than there is there, create temp. limited evap_soi.
         qflx_evap_soi_lim = min(qflx_evap_soi(p), (h2osoi_liq(c,j)+h2osoi_ice(c,j))/dtime)
         qflx_ev_snow(p) = qflx_evap_soi_lim
         if ((h2osoi_liq(c,j)+h2osoi_ice(c,j)) > 0._r8) then
            ! case with both liquid and ice with very small values
            if(h2osoi_ice(c,j)<1.e-12_r8 .and. h2osoi_liq(c,j)<1.e-12_r8) then
              qflx_liqevap_from_top_layer(p) = h2osoi_liq(c,j)/dtime
              qflx_solidevap_from_top_layer(p) = h2osoi_ice(c,j)/dtime
            ! case with only ice with very small values
            else if(h2osoi_ice(c,j)<1.e-12_r8) then
              qflx_solidevap_from_top_layer(p) = h2osoi_ice(c,j)/dtime
              qflx_liqevap_from_top_layer(p) = max(qflx_evap_soi_lim - qflx_solidevap_from_top_layer(p), 0._r8)
            ! case with only liquid with very small values
            else if (h2osoi_liq(c,j)<1.e-12_r8) then
              qflx_liqevap_from_top_layer(p) = h2osoi_liq(c,j)/dtime
              qflx_solidevap_from_top_layer(p) = max(qflx_evap_soi_lim - qflx_liqevap_from_top_layer(p), 0._r8)
            ! original case
            else
              qflx_liqevap_from_top_layer(p) = max(qflx_evap_soi_lim*(h2osoi_liq(c,j)/ &
                 (h2osoi_liq(c,j)+h2osoi_ice(c,j))), 0._r8)
              qflx_solidevap_from_top_layer(p) = max(qflx_evap_soi_lim - qflx_liqevap_from_top_layer(p), 0._r8)
            end if
         else
            qflx_liqevap_from_top_layer(p) = 0._r8
            qflx_solidevap_from_top_layer(p) = 0._r8
         end if
         resid_ice = h2osoi_ice(c,j)-qflx_solidevap_from_top_layer(p)*dtime
         resid_liq = h2osoi_liq(c,j)-qflx_liqevap_from_top_layer(p)*dtime
         if (resid_ice < 0._r8 .and. resid_ice > -1.e-12_r8)then
            if(resid_liq > 1.e-12_r8) then
               qflx_solidevap_from_top_layer(p) = qflx_solidevap_from_top_layer(p) - abs(resid_ice)
               qflx_liqevap_from_top_layer(p) = qflx_liqevap_from_top_layer(p) + abs(resid_ice)
            else
               qflx_solidevap_from_top_layer(p) = qflx_solidevap_from_top_layer(p) - abs(resid_ice)
            end if
         end if
         if (resid_liq < 0._r8 .and. resid_liq > -1.e-12_r8)then
            if(resid_ice > 1.e-12_r8) then
               qflx_liqevap_from_top_layer(p) = qflx_liqevap_from_top_layer(p) - abs(resid_liq)
               qflx_solidevap_from_top_layer(p) = qflx_solidevap_from_top_layer(p) + abs(resid_liq)
            else
               qflx_liqevap_from_top_layer(p) = qflx_liqevap_from_top_layer(p) - abs(resid_liq)
            end if
         end if
      else
         ! if (t_grnd(c) < tfrz) then
         ! Causes rare blowup when thin snow layer should completely melt and has a high temp after thermal physics,
         ! but then is not eliminated in SnowHydrology because of this added frost. Also see below removal of
         ! completely melted single snow layer.
         if (t_grnd(c) < tfrz .and. t_soisno(c,j) < tfrz) then
            qflx_soliddew_to_top_layer(p) = abs(qflx_evap_soi(p))
            ! If top layer is only snow layer, SnowHydrology won't eliminate it if dew is added.
         else if (j < 0 .or. (t_grnd(c) == tfrz .and. t_soisno(c,j) == tfrz)) then
            qflx_liqdew_to_top_layer(p) = abs(qflx_evap_soi(p))
         end if
      end if

Where the variables resid_ice and resid_liq are defined as real after line 127:

real(r8) :: resid_ice, resid_liq

quantheory commented 1 year ago

@daniele-peano Thanks for the information. I will see if the ELM developers are amenable to implementing similar logic in the E3SM version.

wwieder commented 1 year ago

@olyson hasn't seen this issue in any of our simulations. Are there certain cases when you've seen this issue crop up? At the SE meeting this morning we discussed making a note in the source code noting this potential bug (to be done by @samsrabin ). But given that we have not come across this issue we're somewhat inclined to not fix this issue. @billsacks do you think this is an appropriate decision?

billsacks commented 1 year ago

That decision makes sense to me. (It's also possible that some of the snow-related rework that @swensosc and I have done over the last few years made this issue less likely to occur in CTSM, though I haven't thought carefully about that, so maybe that's just wishful thinking.)

quantheory commented 1 year ago

In ELM, this bug is only confirmed to have caused an issue for users twice ever. The only reproducer I have access to where it crashed the model, is a case where it affected only a single grid cell once, 37 years into a run. (However, that one case, combined with issues on perlmutter, held up some release testing for several weeks, so it was quite annoying when it did happen.) As a result, I can't really say much statistically about what kind of cases it affects. It is possible that some other difference in the models (e.g. different fluxes due to the higher boundary layer resolution and gustiness in EAM) increases the chance of a crash in E3SM, and in CESM such a crash would be even more unlikely. That said, it's also possible that this issue silently causes non-fatal inaccuracies more often, and so we simply haven't noticed very infrequent but large overestimates of snow depth.

I think the main argument for putting in a fix here is that while the chance of this bug causing a crash is quite low, the chance of the proposed fix itself causing problems is also quite low. But I'm not that invested in any particular outcome here; I mostly just wanted to send a heads up in case this was an issue that could affect CESM.

billsacks commented 1 year ago

That all makes a lot of sense, @quantheory . Regardless of the decision here, it's very helpful to have this issue to look back at in case it crops up as a possible issue.

ESCOMP / CTSM

`LakeHydrologyMod` can produce unrealistic `snow_depth` values #2227