Closed jinyun1tang closed 6 years ago
Hi, @susburrows , I am not sure if anyone encountered this during the coupled simulation, but I do think it is a serious problem.
@jinyun1tang : the model was extremely stable when we ran the Water Cycle DECK simulations (1700+ years without a crash). But I have heard anecdotal evidence that it may no longer be as stable and there has been some speculations this could be due to the post-DECK bug fixes.
Do you know long how long this bug has been in? Is it in way related to https://github.com/E3SM-Project/E3SM/pull/2204/?
@golaz , it could be something related to radiation, though my reading of the code looks physically reasonable. It occurred to me after I did 25 years and 3 months simulation using the coupler by pass option. The first trigger was the first hour when incoming solar radiation is not zero, and the lake absorption of incident solar radiation is 100%, because lake surface is frozen and there is some residual snow. Nonetheless, the downshoot and overshoot of the Newotn's iteration is a thing I experienced with CLM4.5 a number of years ago, but was never fixed.
@jinyun1tang thanks for alerting me to this. Does the model crash (or exit with an error) when this occurs in your simulations?
@susburrows, yes, it crashed with an out of balance error in long-wave radiation. Beyond this, the concern is some singular lake surface temperature may be hidden in a non-crash simulation.
Thanks @jinyun1tang . @maltrud , I wonder if this might be the same problem you were encountering in your most recent simulation(s)?
@jinyun1tang are you able to identify which versions/configurations of the model are impacted by this bug?
@susburrows , I found this issue by accident. It occurred as after I rebased my code with master, which appeared to included the deck updates of radiation code. Then I ran it with ICLM45BGC with setup of coupler by pass and the code crashed. I thought this was something related with my bgc code, however, it turned out the lake model was the error trigger. I then tracked it down and found the numerical instability of the Newton Raphson iteration. I have not tried to reproduce the error using other code base. Since my code of lake model is identical to that in the master, I asserted this is a real bug and may happen to the master any day. Nonetheless, by confining the Newton searching step within a certain range, the problem can be straightforwardly fixed.
@jinyun1tang, Can you share information on how to reproduce the error? Which compset+res combination were you using, along with any additional namelist changes you made? Also, how long into the simulation you encountered this error?
@bishtgautam , I would suggest using the -compset ICB1850CNPECACNTBC, and -res f19_g16. And run the simulations for 25+ years. Since this is a numerical bug, which may or may not be manifested as significant numerical error, I would suggest tracking the number numerical details by adding following lines in LakeFluxesMod.F90 when updating t_grnd(c) = ax/bx at line 411. That is to check whether abs(t_grnd(c)-tgbef(c))>50, and throw out an error message when it does occur. I have confined this difference to 20, which fixed the issue, but my simulation failed with the difference greater than 150.
@jinyun1tang - any chance this would also impact the water budget? The long DECK run showed lnd consistently gaining water throughout the simulation
@jonbob I am not quite sure, but I tend not to think so, because this is related to lake temperature and energy balance only.
When computing surface temperature of lake body In LakeFluxesMod.F90, the Newton Raphson iteration occasionally overshoots or donwshoots, resulting in temperatures lower than 230 K and higher than 400 K. This may under peculiar conditions show up as a fail in long wave energy balance, but most of the time is hidden behind the simulation. The bug was identified and fixed in my branch jinyuntang/lnd/betr_rebase, which used identical code for lake model. So the same fix should be applied to the master.