E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
348 stars 358 forks source link

Non-bfb restarts bug due to use of cflx(:,1) (or qflx) in clubb_intr.F90 #1040

Closed singhbalwinder closed 8 years ago

singhbalwinder commented 8 years ago

We recently found out that using cflx (:,1) (i.e. qflx) is causing the model to produce non-BFB restarts. I looked into the code and found out that following is the reason for this behavior:

We recently introduced the following "if block" for fixing the last non-BFB restart bug issue #906:

                ! initialize constituent surface fluxes to zero
      cam_in(c)%cflx(:,:) = 0._r8

      if (overwrite_flds) then
           ! Prior to this change, "overwrite_flds" was always .true. therefore wsx and wsy were always updated.
           ! Now, overwrite_flds is .false. for the first time step of the restart run. Move wsx and wsy out of
           ! this if-condition so that they are still updated everytime irrespective of the value of overwrite_flds.

           ! Move lhf to this if-block so that it is not overwritten to ensure BFB restarts when qneg4 correction
           ! occurs at the restart time step
           ! Modified by Wuyin Lin
           cam_in(c)%shf(i)    = -x2a(index_x2a_Faxx_sen, ig)
           cam_in(c)%cflx(i,1) = -x2a(index_x2a_Faxx_evap,ig)
           cam_in(c)%lhf(i)    = -x2a(index_x2a_Faxx_lat, ig)
       endif

Value of “overwrite_flds” is .false. for the first restart time step so the physics sees cflx(:,1) as zero for the first restart time step. The values of cflx(i,1) are non-zero (expected values) on the subsequent steps in the restart run. This causes results to differ in the restart run.

As expected, if we remove cflx from the above if condition, the restarts are again non-BFB due to the qneg4 update issue (last non-BFB restart bug- issue #906). I verified this by commenting out qneg4 call AND removing “cam_in(c)%cflx(i,1) = -x2a(index_x2a_Faxx_evap,ig)” line from the if block and adding it after the “endif”. Model has BFB restarts if I do that.

There are several ways to fix this but I am not very familiar with this part of the codes so I am not sure if those alternatives can break anything else in the code. Several of those alternatives are discussed in issue #906. Fixing this is critical and we would like to fix this bug ASAP. Tagging @mt5555 @wlin7 @worleyph @rljacob

singhbalwinder commented 8 years ago

@philrasch suggested me to test @kaizhangpnl branch (PR #1041 ) if that fixes this restart issue and also solves the previous restart issue (#906 ). I will give that a try.

mt5555 commented 8 years ago

Is it correct that Kai's conservation fix for QNEG3 and QNEG4 dont break BFB restarts, but it's only the conservation fixes in CLUBB that break BFB restart?

And, (I'll try and check the code to determine this), is the conservation fix in CLUBB only applied to cflx(:,1)? If it is applied to other fluxes, like cflx(:,2), then they will have to be added to the restart file and also moved to the if (overwrite_flds) loop above.

mt5555 commented 8 years ago

It lools like the new routine, qqflx_fixer(), will modify cflx(:,i_wv) (edit: i_wv = 1, so removing my text about i_wv > 1 restart bug)

But I'm confused by the clfx(:,:)=0 line. On a non-restart run, that's ok because cflx() will be set from the coupler below. But on a restart run, we want to use cflx from the restart file, and setting it to zero would overwrite the values of cflx() read in from the restart file.

kaizhangpnl commented 8 years ago

Hi Mark, qqflx_fixer will only modify cflx (qflx) when the water vapor in the whole column is not sufficient to compensate the negative qflx. This never happens in my simulations. Balwinder suggested to add a call endrun instead of correcting qflx like qneg4 and letting the model go.

singhbalwinder commented 8 years ago

Is it correct that Kai's conservation fix for QNEG3 and QNEG4 dont break BFB restarts, but it's only the conservation fixes in CLUBB that break BFB restart?

  • I just found out that Kai's fixes also break the restart test. The reason is the same. His fixes use qflx on the restart time step, which is zero. If I remove "cam_in(c)%cflx(i,1) = -x2a(index_x2a_Faxx_evap,ig)" from the "if(overwrite_flds)" block and add it after "endif", Kai fixes restart BFB.

And, (I'll try and check the code to determine this), is the conservation fix in CLUBB only applied to cflx(:,1)? If it is applied to other fluxes, like cflx(:,2), then they will have to be added to the restart file and >also moved to the if (overwrite_flds) loop above.

-cflx(:,1) is only used in clubb_intr.F90, it is not modified there. None of the other cflx members (clfx(:,2) etc.) are being used of modified in clubb_intr.F90 for the qflx bug fix ( issue #1032).

So the issue is: cflx(:,1) is zero on the first restart time step and if atm codes try to use clfx(:,1), we get non-BFB restarts.

mt5555 commented 8 years ago

Can you try changing the cflx line to:

cam_in(c)%cflx(:,2:pcnst) = 0._r8

For cflx(:,1), it should be read from the restart file, or set via the coupler (x2a code). So I dont understand why it is being iniitalized to zero.

singhbalwinder commented 8 years ago

But I'm confused by the clfx(:,:)=0 line. On a non-restart run, that's ok because cflx() will be set from the coupler below. But on a restart run, we want to use cflx from the restart file, and setting it to zero would overwrite the values of cflx() read in from the restart file.

This line was present in the original code we started with (cesm1_3_beta10) and we didn't remove it when we added code to fix issue #906 . Do you think it should be removed?

singhbalwinder commented 8 years ago

Can you try changing the cflx line to:

cam_in(c)%cflx(:,2:pcnst) = 0._r8

For cflx(:,1), it should be read from the restart file, or set via the coupler (x2a code). So I dont understand why it is being iniitalized to zero.

I will give that a try.

mt5555 commented 8 years ago

I'm not sure, but I suspect it is a bug. In the original code, we always set cflx(:,1) = x2a..., so the bug wasn't triggered.

But there's a 50% chance I just dont understand the code :-)

Also, perhaps @wlin7 could confirm that cflx(:,1) is actually written and read from the restart file? I'm not familiar with the restart code.

mt5555 commented 8 years ago

@kaizhangpnl : is it possible that cflx(:,1) is usually zero when calling tphysbc?

because it seems like when we restart, (before all the recent restart bugs), the code we are discussing here will initialized cflx(:,1) to zero, and then call cam_run1, including tphsybc(), and then enter the usual CAM timstepping loop.

wlin7 commented 8 years ago

Hi @singhbalwinder , @mt5555 , @kaizhangpnl ,

Sorry joining in late. I believe setting "cam_in(c)%cflx(:,2:pcnst) = 0._r8" as suggested by Mark will fix it all. Moving the cflx(i,1)= -x2a(index_x2a_Faxx_evap,ig) outside of the if-block would compromise the original fix for qneg4. Explanation below:

  1. The original qneg4-restart fix was purposely not to override the fluxes during the first step of restart run, which makes a difference if qneg4 occurs during the restart step. I think in your current test, it appeared to be BFB-restart likely because QNEG4 (or qqflx_fixer) did not occur (please check the cesm.log in your run). I remember while we were dealing with the qneg4-restart issue on AV1C-03, you found the then new compset AV1C-04 did not have the problem. We still need this fix in place because QNEG4 (or qqflx_fixer) might still occur at a restart step.
  2. The cam_in(c)%cflx(:,:) = 0._r8 meant to have zero surface fluxes for all tracers unless there are fluxes from the coupler for certain constituents.
  3. With the new change to only set clfx(:,2:pcnst), cflx(:,1) will keep the values from restart file during restart_init (and not to be overrided in the if-block); and take the values from the coupler for all other situations.
  4. Your are correct cflx(:,1) was not used during the first step of restart run before the qlfx fix in clubb_intr.F90. lhf and shf were involved and that is why the original qneg4-restart fix has been working. Since cflx is now involved, zeroing the whole cam_in(c)%clfx(:,:) becomes a problem.
mt5555 commented 8 years ago

@wlin7 : thanks, that's great!

wlin7 commented 8 years ago

@mt5555 , cflx(:,1) is rouhgly lhf/L all the time. Except for cold spinup, they are typically non-zeroes.

singhbalwinder commented 8 years ago

Thanks @mt5555 and @wlin7 ! I tested Mark's fix and it works! I will create a PR to fix this bug.

mt5555 commented 8 years ago

@wlin7 : could it be that at the start of tphysbc(), cflx(:,1)=0, but then inside tphysbc() is when cflx(:,1) is computed and becomes roughly lhf/L ?

singhbalwinder commented 8 years ago

I don't think cflx(:,1) is ever modified in physpkg. except for qneg4 but I am not 100% sure.

wlin7 commented 8 years ago

I am on the same page with @singhbalwinder . With the design of the coupler, the surface fluxes are set to be handled by the other components then exchanged thru the coupler. In terms of the codes, the whole cam_in (inc. cflx) is passed into tphysbc with intent(IN).

philrasch commented 8 years ago

I dont understand why we are ever zeroing out cflx for a restart, unless it is to prevent use of an uninitialized variable. cflx should contain the surface fluxes calculated from either the atmosphere, or from another model component, e.g. provided by the coupler. Typically I have viewed cflx as an "input" variable to the atmosphere, but there were times where we have read information directly into that field within the atmospheric model, rather than providing it through the coupler.

Anyway, the confusion sets in when QNEG4 started modifying cflx, because under those circumstances, some of the information comes from the coupler, and others come from the atmospheric model (e.g. QNEG4). Under these circumstances, it becomes critical to make sure that the atmospheric model components that use cflx see the value that has been modified by both components. Otherwise, we just have to make sure that the something appropriately initializes cflx at some point before it is used. I would prefer it if we initialized cflx to a NaN and trap for it, so if the right model component does not set it, we know about it.

The same sort of comments ought to apply to other fields in the "cam_in" variable (data type cam_in_t)

kaizhangpnl commented 8 years ago

I see that cflx is read from the restart file in restart_physics.F90

712 do m = 1, pcnst 713 714 write(num,'(i4.4)') m 715 716 !!XXgoldyXX: This hack should be replaced with the PIO interface 717 !err_handling = File%iosystem%error_handling !! Hack 718 call pio_seterrorhandling(File, PIO_BCAST_ERROR, err_handling) 719 ierr = pio_inq_varid(File, 'CFLX'//num, vardesc) 720 call pio_seterrorhandling(File, err_handling) 721 722 if (ierr == PIO_NOERR) then ! CFLX variable found on restart file 723 call pio_read_darray(File, vardesc, iodesc, tmpfield2, ierr) 724 do c= begchunk, endchunk 725 do i = 1, pcols 726 cam_in(c)%cflx(i,m) = tmpfield2(i, c) 727 end do 728 end do 729 end if 730 731 end do 732

mt5555 commented 8 years ago

Thanks @kaizhangpnl for verifying that.

I agree with Phil that is strange that CFLX is set to zero on a restart run - I cant understand how the model was doing BFB exact restarts the past few months (since we fixed the QNEG3 restart bug), but we know it was. Is it possible that CFLX is not used by tphysbc(), but only tphysac()? (on a restart run, CFLX will be retrieved from the coupler before calling tphysac())

kaizhangpnl commented 8 years ago

@kaizhangpnl : is it possible that cflx(:,1) is usually zero when calling tphysbc?

because it seems like when we restart, (before all the recent restart bugs), the code we are discussing here will initialized cflx(:,1) to zero, and then call cam_run1, including tphsybc(), and then enter the usual CAM timstepping loop.

As Wuyin mentioned already, cflx is not used in tphysbc previously, but cflx(:,1) values are not always zero.

In my mods, I moved the qqflx_fixer to the point before CLUBB is called, so that the fixer is called after each sub-cycle when the Q profile is updated in CLUBB and MG2. I also added diagnostic qflx output in tphysbc.

philrasch commented 8 years ago

cflx was not used by the tphysbc routines until CLUBB moved the turbulent transport from tphysac to tphysbc. Then we replaced the LHFLX/latvap calculation with QFLX=cflx(:1).

I have an older version of restart_physics.F90 with CLUBB that does not have cflx set to zero. It would be interesting to know when that line was introduced into the code