Enable restarts in WRF-LILAC-CTSM coupling

billsacks commented 4 years ago

For now I'm not handling restarts in the WRF-LILAC-CTSM coupling. We'll need to return to this, allowing:

Writing of restarts at the appropriate time (update: this is a separate issue now, #909 )
Determining if this is a restart run, and if so, sending the appropriate information so that CTSM starts from a restart file

billsacks commented 4 years ago

This issue encompasses any changes needed to force LILAC and CTSM to start up from restart files – i.e., doing a continue run rather than a startup run. (In contrast, #909 is about writing restart files at the appropriate time.)

Following the guidance of @mvertens 's comment in #863 , I have demonstrated to myself that restarts work from the demo lilac atm driver. All that's needed there is setting atm_starttype to 'continue' rather than 'startup'. This flag is passed to lilac_init2 via the starttype_in argument.

Thus, I believe that all WRF needs to do is to set the starttype_in argument appropriately. Then, as long as the necessary rpointer and restart files are present in the run directory from an earlier run, LILAC and CTSM should start up from those files.

Thanks to @mvertens for putting in place this restart capability in LILAC a few months ago!

billsacks commented 4 years ago

In case it's helpful, the process for doing a restart run with LILAC's demo atm driver is:

For the first run, set the following in atm_driver_in

  atm_start_day     = 1
  atm_stop_day      = 2
  atm_starttype      = 'startup'

For the restart run, change atm_driver_in to contain:

 atm_start_day         = 1
 atm_stop_day          = 3
 atm_starttype         = 'continue'

This can be compared against a straight-through run, in which atm_driver_in contains:

 atm_start_day         = 1
 atm_stop_day          = 3
 atm_starttype         = 'startup'

Tracing the relevant variables through the atm_driver code, though, I believe that the only important thing from LILAC-CTSM's perspective is the setting of the starttype_in argument to lilac_init2 (as mentioned above): everything else done with these namelist items seems to be specific to the demo atm driver, and shouldn't be needed in a real atmosphere (or rather, a real atmosphere should already be doing everything else it needs, if it already supports restarts).

billsacks commented 4 years ago

At one point I thought it might also be necessary to set start_type = continue in ctsm.cfg (this gets passed to CTSM's build-namelist). This doesn't appear to be necessary from my testing, but it may be that it should still be done.

@ekluzek do you know what the -clm_start_type argument to CLMBuildNamelist accomplishes?

billsacks commented 4 years ago

Also note: When run via the demo atm driver, it appears that the various atm_start_* arguments to lilac_init2 give the start time of the very beginning of the run, NOT the start time of this restart run segment. However, I'm not sure if it matters. @mvertens do you know if these atm_start_* arguments are important in a restart run? If they aren't, we should add a comment saying that they are ignored for starttype_in being 'continue'. Speaking of which, we should also document the allowed values for starttype_in (I'm not sure of this myself... is it just 'startup' and 'continue'?)

billsacks commented 4 years ago

I traced the logic of clm_start_type and it doesn't look like it needs to be set differently for restarts in the lilac context. There's some old logic that sets a bit of driver namelist based on this, which to the best of my understanding could be removed (maybe a carryover from when CLM could be run in standalone mode outside of CESM???). Other than that, it seems to be used to determine whether a default value is needed for finidat, but it seems safe to just let it stay set at default even in a restart run. For what it's worth, there were no differences in the lnd_in file generated by lilac_config/buildnml when I set start_type to default, startup or continue.

negin513 commented 4 years ago

Thus, I believe that all WRF needs to do is to set the starttype_in argument appropriately

I've implemented a capability in WRF to pass in starttype_in as either "continue" or "restart". It seems that WRF is sending correct value to LILAC. When running from a restart file, I receive the following error from LILAC side:

20200701 103542.160 ERROR            PET000 lnd_comp_esmf:[lnd_run]  CTSM clock not in sync with lilac clock

@billsacks do you have any insight on what might cause such an issue? I have not followed the logic in CTSM side.

billsacks commented 4 years ago

I'm not sure off-hand. It looks like that error message should print the ctsm and lilac times just before the error message you pasted in:

    ! Note that the driver clock has not been updated yet - so at this point
    ! CTSM is actually 1 coupling intervals ahead of the driver clock

    if ( (ymd /= ymd_lilac) .or. (tod /= tod_lilac) ) then
       write(iulog,*)'ctsm  ymd=',ymd      ,' ctsm  tod= ',tod
       write(iulog,*)'lilac ymd=',ymd_lilac,' lilac tod= ',tod_lilac
       call ESMF_LogWrite(subname//" CTSM clock not in sync with lilac clock",ESMF_LOGMSG_ERROR)
       rc = ESMF_FAILURE
       return
    end if

What do you see for that output?

negin513 commented 4 years ago

Thanks for looking into this. This is copied from the log file.

 clm: completed timestep           31   
 lilac ymd=    20130401  lilac tod=         2700 
 ctsm  ymd=    20130401  ctsm  tod=         2700 
 lilac ymd=    20130401  lilac tod=         3600 
 ERROR: lilac error in running ctsm

I am restarting from minute 15 in wrf which translates correctly because this shows up in the log:

CTSM start time: 2013  4  1   900

But then I have the following in the log file:

------------------------------------------------------------
 Successfully read restart data for restart run

 Successfully initialized the land model
 begin continuation run at:                                                                                                                                                          
    nstep=           31  year=         2013  month=            4  day= 
           1  seconds=         2790

In case you would like to take a look at the log it is in : /glade/scratch/negins/wrf_testing/test/em_real_ctsm/rsl.out.0000

billsacks commented 4 years ago

Could this be the cause of the problem?: It looks like the lilac and ctsm restart files are being taken from different times, according to the rpointer files:

--- rpointer.lilac ---
ctsm_lilac.lilac.r.2013-04-01-03600.nc
--- rpointer.lnd ---
./ctsm_lilac.clm2.r.2013-04-01-02700.nc

I'm pretty sure that both ctsm and lilac read whatever restart file is given in that rpointer file. I'm not sure off-hand what would cause them to be out of sync like this, but does that information give you anything more to go on?

negin513 commented 4 years ago

Thanks for the response. When I start from minute 15 in WRF, shouldn't they be starting from ctsm_lilac.clm2.r.2013-04-01-00900.nc and ctsm_lilac.clm2.rh0.2013-04-01-00900.nc? I have to dig into the CTSM/LILAC side to see what causes this issue. What I think is that for some reason when starttype_in is continue in CTSM, it does not start from the correct restart file.

Thus, I believe that all WRF needs to do is to set the starttype_in argument appropriately.

Based on this I am only setting starttype_in to continue for restart runs, but should we have any additional codes to change atm_start_day or atm_start_hour, etc.?

billsacks commented 4 years ago

I'm not sure if I ever looked closely, but I'm pretty sure that the restart logic in LILAC/CTSM is set up similarly to what we do in CESM: The code looks for an rpointer file and reads whatever file name is listed there. It's possible we'll want to change this logic at some point, but for now I think that's what's done. So, for now, if you want it to read something different from what's listed in the rpointer file, the simplest thing to do is probably just to hand-edit the rpointer file to point to the appropriate restart files for ctsm & lilac.

billsacks commented 4 years ago

By the way: I guess my earlier comments in this issue were assuming that the rpointer file was written to point to the latest restart file, and that that is the restart file you want to start from. Sorry I didn't make that explicit.

negin513 commented 4 years ago

It seems like manually changing the land and lilac rpointer files to point to WRF restart start time fixes this issue.

I would recommend changing this in the future, but as @billsacks suggested that requires higher-level discussion on how rpointer files in CESM works. I am still not sure why we need rpointer files at all in CESM and CTSM pointing to the last restart file.

negin513 commented 4 years ago

On the other hand, WRF-CTSM runs seem to crash after some timesteps (writing both output and restart files successfully) with the following message:

Obtained 10 stack frames.
./wrf.exe() [0x39bcb53]
./wrf.exe() [0x39eb7b8]
./wrf.exe() [0x39f2191]
./wrf.exe() [0x39af725]
./wrf.exe() [0x393cdd3]
./wrf.exe() [0x2fa0b86]
./wrf.exe() [0x2f17d84]
./wrf.exe() [0x2ea35ed]
/glade/work/himanshu/PROGS/esmf/8.1.0b14/mpt/2.19/intel/19.0.5/lib/libO/Linux.intel.64.mpt.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0x44e) [0x2b4b8bc3048e]
/glade/work/himanshu/PROGS/esmf/8.1.0b14/mpt/2.19/intel/19.0.5/lib/libO/Linux.intel.64.mpt.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x21b) [0x2b4b8bc3404b]

These are the bottom lines of rsl.out files:

 hist_htapes_wrapup : Writing current time sample to local history file 
 ./ctsm_lilac.clm2.h2.2013-04-01-00000.nc at nstep =           40 
  for history time interval beginning at   0.000000000000000E+000  and ending at
    4.166666666666666E-002

 hist_htapes_wrapup : history tape            1 : no open file to close
 hist_htapes_wrapup : history tape            2 : no open file to close

 hist_htapes_wrapup : Closing local history file 
 ./ctsm_lilac.clm2.h2.2013-04-01-00000.nc at nstep =           40

Any comments would be appreciated on this.

The path to this simulation is here: /glade/scratch/negins/wrf_testing/test/em_real_ctsm

billsacks commented 4 years ago

From using addr2line on the traceback, it indeed looks like this is dying when trying to close the file. It seems like the line numbers of the traceback are a bit off, maybe because this isn't built in debug mode, but I see:

$ addr2line -e wrf.exe 0x2fa0b86
/glade/scratch/negins/wrf_testing/CTSM/src/main/histFileMod.F90:3564
$ addr2line -e wrf.exe 0x393cdd3
/glade/scratch/negins/wrf_testing/CTSM/cime/src/externals/pio2/src/flib/piolib_mod.F90:1518
$ addr2line -e wrf.exe 0x39af725
/glade/scratch/negins/wrf_testing/CTSM/cime/src/externals/pio2/src/clib/pio_file.c:421

I realize that isn't much help....

billsacks commented 4 years ago

I'm wondering if this will work with pio1 rather than pio2. Let me get back to you on whether this is worth trying.

negin513 commented 4 years ago

Just noticed that I actually tested changing restart_interval and the same error happened after some timesteps.

negin513 commented 4 years ago

Switching from PIO2 to PIO1 caused the following error during the runtime:

 paramMod.F90::readParameters :: reading CLM  parameters 

 (GETFIL): attempting to find local file clm5_params.c200519.nc

 (GETFIL): using  /glade/scratch/slevis/ctsm_build_dir/inputdata/lnd/clm2/paramdata/clm5_params.c200519.nc

 ncd_inqvid: variable theta_cj is not on dataset

 ENDRUN:
 ERROR: 
 -Error reading in parameters file:theta_cjERROR in PhotosynthesisMod.F90 at line 685

negin513 commented 4 years ago

This issue was solved by regenerating the lnd_in file using make_runtime_inputs script.

negin513 commented 4 years ago

Using PIO1 instead of PIO2 takes care of the issue that we were experiencing. Thanks to @billsacks for suggesting this. The restart capability of WRF-CTSM is now working using PIO1. This capability is validated using cprnc command for BFB similarity of outputs from continuous run and the restart run.

billsacks commented 4 years ago

Great, that's fantastic to hear!

To clarify for future reference:

Switching from PIO2 to PIO1 caused the following error during the runtime

That error was not caused by the switch to PIO1, but rather by the update in CTSM version, which required an update to the lnd_in file.

Thank you for your work on this @negin513 ! Feel free to close this issue if/when you feel it is fully resolved.

billsacks commented 4 years ago

@negin513 now that we are using pio2 again, can you please rerun some testing of the latest ctsm in wrf to see if you run into problems like https://github.com/ESCOMP/CTSM/issues/876#issuecomment-653189406 ? I think most likely the fixes we put in place for pio2 recently will solve that problem, too, but it would be good to confirm before we advertise this.

negin513 commented 4 years ago

@billsacks: I used the latest version of ctsm and I ran a month of wrf-ctsm with pio2 without any problems. :-) So that is good and I don't think we have the same issue anymore! For double-checking I am running another simulation starting from restart to check if this problem shows up or not. I will let you know if there is any problem with the restart run.

billsacks commented 4 years ago

Thanks @negin513 !

ESCOMP / CTSM

Enable restarts in WRF-LILAC-CTSM coupling #876