Closed billsacks closed 4 years ago
This issue encompasses any changes needed to force LILAC and CTSM to start up from restart files – i.e., doing a continue run rather than a startup run. (In contrast, #909 is about writing restart files at the appropriate time.)
Following the guidance of @mvertens 's comment in #863 , I have demonstrated to myself that restarts work from the demo lilac atm driver. All that's needed there is setting atm_starttype
to 'continue' rather than 'startup'. This flag is passed to lilac_init2
via the starttype_in
argument.
Thus, I believe that all WRF needs to do is to set the starttype_in
argument appropriately. Then, as long as the necessary rpointer and restart files are present in the run directory from an earlier run, LILAC and CTSM should start up from those files.
Thanks to @mvertens for putting in place this restart capability in LILAC a few months ago!
In case it's helpful, the process for doing a restart run with LILAC's demo atm driver is:
For the first run, set the following in atm_driver_in
atm_start_day = 1
atm_stop_day = 2
atm_starttype = 'startup'
For the restart run, change atm_driver_in
to contain:
atm_start_day = 1
atm_stop_day = 3
atm_starttype = 'continue'
This can be compared against a straight-through run, in which atm_driver_in
contains:
atm_start_day = 1
atm_stop_day = 3
atm_starttype = 'startup'
Tracing the relevant variables through the atm_driver
code, though, I believe that the only important thing from LILAC-CTSM's perspective is the setting of the starttype_in
argument to lilac_init2
(as mentioned above): everything else done with these namelist items seems to be specific to the demo atm driver, and shouldn't be needed in a real atmosphere (or rather, a real atmosphere should already be doing everything else it needs, if it already supports restarts).
At one point I thought it might also be necessary to set start_type = continue
in ctsm.cfg
(this gets passed to CTSM's build-namelist). This doesn't appear to be necessary from my testing, but it may be that it should still be done.
@ekluzek do you know what the -clm_start_type
argument to CLMBuildNamelist accomplishes?
Also note: When run via the demo atm driver, it appears that the various atm_start_*
arguments to lilac_init2
give the start time of the very beginning of the run, NOT the start time of this restart run segment. However, I'm not sure if it matters. @mvertens do you know if these atm_start_*
arguments are important in a restart run? If they aren't, we should add a comment saying that they are ignored for starttype_in
being 'continue'. Speaking of which, we should also document the allowed values for starttype_in
(I'm not sure of this myself... is it just 'startup' and 'continue'?)
I traced the logic of clm_start_type
and it doesn't look like it needs to be set differently for restarts in the lilac context. There's some old logic that sets a bit of driver namelist based on this, which to the best of my understanding could be removed (maybe a carryover from when CLM could be run in standalone mode outside of CESM???). Other than that, it seems to be used to determine whether a default value is needed for finidat
, but it seems safe to just let it stay set at default
even in a restart run. For what it's worth, there were no differences in the lnd_in
file generated by lilac_config/buildnml
when I set start_type
to default
, startup
or continue
.
Thus, I believe that all WRF needs to do is to set the
starttype_in
argument appropriately
I've implemented a capability in WRF to pass in starttype_in as either "continue" or "restart". It seems that WRF is sending correct value to LILAC. When running from a restart file, I receive the following error from LILAC side:
20200701 103542.160 ERROR PET000 lnd_comp_esmf:[lnd_run] CTSM clock not in sync with lilac clock
@billsacks do you have any insight on what might cause such an issue? I have not followed the logic in CTSM side.
I'm not sure off-hand. It looks like that error message should print the ctsm and lilac times just before the error message you pasted in:
! Note that the driver clock has not been updated yet - so at this point
! CTSM is actually 1 coupling intervals ahead of the driver clock
if ( (ymd /= ymd_lilac) .or. (tod /= tod_lilac) ) then
write(iulog,*)'ctsm ymd=',ymd ,' ctsm tod= ',tod
write(iulog,*)'lilac ymd=',ymd_lilac,' lilac tod= ',tod_lilac
call ESMF_LogWrite(subname//" CTSM clock not in sync with lilac clock",ESMF_LOGMSG_ERROR)
rc = ESMF_FAILURE
return
end if
What do you see for that output?
Thanks for looking into this. This is copied from the log file.
clm: completed timestep 31
lilac ymd= 20130401 lilac tod= 2700
ctsm ymd= 20130401 ctsm tod= 2700
lilac ymd= 20130401 lilac tod= 3600
ERROR: lilac error in running ctsm
I am restarting from minute 15 in wrf which translates correctly because this shows up in the log:
CTSM start time: 2013 4 1 900
But then I have the following in the log file:
------------------------------------------------------------
Successfully read restart data for restart run
Successfully initialized the land model
begin continuation run at:
nstep= 31 year= 2013 month= 4 day=
1 seconds= 2790
In case you would like to take a look at the log it is in :
/glade/scratch/negins/wrf_testing/test/em_real_ctsm/rsl.out.0000
Could this be the cause of the problem?: It looks like the lilac and ctsm restart files are being taken from different times, according to the rpointer files:
--- rpointer.lilac ---
ctsm_lilac.lilac.r.2013-04-01-03600.nc
--- rpointer.lnd ---
./ctsm_lilac.clm2.r.2013-04-01-02700.nc
I'm pretty sure that both ctsm and lilac read whatever restart file is given in that rpointer file. I'm not sure off-hand what would cause them to be out of sync like this, but does that information give you anything more to go on?
Thanks for the response.
When I start from minute 15 in WRF, shouldn't they be starting from ctsm_lilac.clm2.r.2013-04-01-00900.nc
and ctsm_lilac.clm2.rh0.2013-04-01-00900.nc
?
I have to dig into the CTSM/LILAC side to see what causes this issue. What I think is that for some reason when starttype_in is continue
in CTSM, it does not start from the correct restart file.
Thus, I believe that all WRF needs to do is to set the
starttype_in
argument appropriately.
Based on this I am only setting starttype_in to continue for restart runs, but should we have any additional codes to change atm_start_day or atm_start_hour, etc.?
I'm not sure if I ever looked closely, but I'm pretty sure that the restart logic in LILAC/CTSM is set up similarly to what we do in CESM: The code looks for an rpointer file and reads whatever file name is listed there. It's possible we'll want to change this logic at some point, but for now I think that's what's done. So, for now, if you want it to read something different from what's listed in the rpointer file, the simplest thing to do is probably just to hand-edit the rpointer file to point to the appropriate restart files for ctsm & lilac.
By the way: I guess my earlier comments in this issue were assuming that the rpointer file was written to point to the latest restart file, and that that is the restart file you want to start from. Sorry I didn't make that explicit.
It seems like manually changing the land and lilac rpointer files to point to WRF restart start time fixes this issue.
I would recommend changing this in the future, but as @billsacks suggested that requires higher-level discussion on how rpointer files in CESM works. I am still not sure why we need rpointer files at all in CESM and CTSM pointing to the last restart file.
On the other hand, WRF-CTSM runs seem to crash after some timesteps (writing both output and restart files successfully) with the following message:
Obtained 10 stack frames.
./wrf.exe() [0x39bcb53]
./wrf.exe() [0x39eb7b8]
./wrf.exe() [0x39f2191]
./wrf.exe() [0x39af725]
./wrf.exe() [0x393cdd3]
./wrf.exe() [0x2fa0b86]
./wrf.exe() [0x2f17d84]
./wrf.exe() [0x2ea35ed]
/glade/work/himanshu/PROGS/esmf/8.1.0b14/mpt/2.19/intel/19.0.5/lib/libO/Linux.intel.64.mpt.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0x44e) [0x2b4b8bc3048e]
/glade/work/himanshu/PROGS/esmf/8.1.0b14/mpt/2.19/intel/19.0.5/lib/libO/Linux.intel.64.mpt.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x21b) [0x2b4b8bc3404b]
These are the bottom lines of rsl.out files:
hist_htapes_wrapup : Writing current time sample to local history file
./ctsm_lilac.clm2.h2.2013-04-01-00000.nc at nstep = 40
for history time interval beginning at 0.000000000000000E+000 and ending at
4.166666666666666E-002
hist_htapes_wrapup : history tape 1 : no open file to close
hist_htapes_wrapup : history tape 2 : no open file to close
hist_htapes_wrapup : Closing local history file
./ctsm_lilac.clm2.h2.2013-04-01-00000.nc at nstep = 40
Any comments would be appreciated on this.
The path to this simulation is here: /glade/scratch/negins/wrf_testing/test/em_real_ctsm
From using addr2line on the traceback, it indeed looks like this is dying when trying to close the file. It seems like the line numbers of the traceback are a bit off, maybe because this isn't built in debug mode, but I see:
$ addr2line -e wrf.exe 0x2fa0b86
/glade/scratch/negins/wrf_testing/CTSM/src/main/histFileMod.F90:3564
$ addr2line -e wrf.exe 0x393cdd3
/glade/scratch/negins/wrf_testing/CTSM/cime/src/externals/pio2/src/flib/piolib_mod.F90:1518
$ addr2line -e wrf.exe 0x39af725
/glade/scratch/negins/wrf_testing/CTSM/cime/src/externals/pio2/src/clib/pio_file.c:421
I realize that isn't much help....
I'm wondering if this will work with pio1 rather than pio2. Let me get back to you on whether this is worth trying.
Just noticed that I actually tested changing restart_interval and the same error happened after some timesteps.
Switching from PIO2 to PIO1 caused the following error during the runtime:
paramMod.F90::readParameters :: reading CLM parameters
(GETFIL): attempting to find local file clm5_params.c200519.nc
(GETFIL): using /glade/scratch/slevis/ctsm_build_dir/inputdata/lnd/clm2/paramdata/clm5_params.c200519.nc
ncd_inqvid: variable theta_cj is not on dataset
ENDRUN:
ERROR:
-Error reading in parameters file:theta_cjERROR in PhotosynthesisMod.F90 at line 685
This issue was solved by regenerating the lnd_in
file using make_runtime_inputs script
.
Using PIO1 instead of PIO2 takes care of the issue that we were experiencing. Thanks to @billsacks for suggesting this. The restart capability of WRF-CTSM is now working using PIO1. This capability is validated using cprnc command for BFB similarity of outputs from continuous run and the restart run.
Great, that's fantastic to hear!
To clarify for future reference:
Switching from PIO2 to PIO1 caused the following error during the runtime
That error was not caused by the switch to PIO1, but rather by the update in CTSM version, which required an update to the lnd_in file.
Thank you for your work on this @negin513 ! Feel free to close this issue if/when you feel it is fully resolved.
@negin513 now that we are using pio2 again, can you please rerun some testing of the latest ctsm in wrf to see if you run into problems like https://github.com/ESCOMP/CTSM/issues/876#issuecomment-653189406 ? I think most likely the fixes we put in place for pio2 recently will solve that problem, too, but it would be good to confirm before we advertise this.
@billsacks: I used the latest version of ctsm and I ran a month of wrf-ctsm with pio2 without any problems. :-) So that is good and I don't think we have the same issue anymore! For double-checking I am running another simulation starting from restart to check if this problem shows up or not. I will let you know if there is any problem with the restart run.
Thanks @negin513 !
For now I'm not handling restarts in the WRF-LILAC-CTSM coupling. We'll need to return to this, allowing: