statefile checkpointing

When a VIC simulation is interrupted, an option to checkpoint based on the statefile would be helpful. That is to say, if I resubmit the interrupted simulation, instead of re-running the first active cell in the soil file (and all others that ran successfully), the model would first check the output statefile to see which cell was last completed. Then, rather than starting at the first active cell the model would skip to the next active cell after the last completed. Furthermore, the statefile would be appended to so that upon completion of the full domain the end state file would be fully populated for all active cells (those that initially succeeded and those that eventually succeeded after re-submission(s)).

A second (more intricate) level of functionality this feature might include would be checkpointing based on a combination of the statefile being successfully written and the simulation successfully reaching the end of the modeling window. This would serve to catch the case where the user wants state written out sometime before the end of the modeling window.

Thanks @mrstu. This is definitely a feature we want to add to all the drivers, including the RASM driver.

I'll give an example of how this sort of thing is done in CESM and we can discuss if that would meet your desired functionality as well as the rest of VIC development. This will be pretty long but I think it will help how it could be done.

First, each run is configured with REST_OPTION, and REST_N or REST_DATE . These values indicate how frequently to have the model write restart files. In the example below, I write a restart file every 4 months.

<!--"sets frequency of model restart writes (same options as STOP_OPTION) (must be nyear(s) for _GLC compsets) (char) " -->
<entry id="REST_OPTION"   value="nmonths"  />    

<!--"sets model restart writes with REST_OPTION and REST_DATE (char) " -->
<entry id="REST_N"   value="4"  />    

<!--"date in yyyymmdd format, sets model restart write date with REST_OPTION and REST_N (char) " -->
<entry id="REST_DATE"   value="$STOP_DATE"  />

Then, during running of the model, there is a function that checks, after each timestep, if it is time to write a start file. In VIC.5, this would a driver level function.

bool
write_state() {

   bool state_flag = false;

   // Some time/calendar computations
   ...
   if (time == some_state_time) {
       state_flag = true;
   }
   return state_flag;
}

Then, in the run time portion of the code, the model just does a quick check to see if it is time to write the state file:

 // loop over all timesteps
 for (current = 0; current < global_param.nrecs; current++) {
     // read forcing data
     vic_force();

     // run vic over the domain
     vic_image_run();

     // if output:
     vic_write();

     // if save:
     if (write_state()) {
         vic_store();
     }
 }

One thing CESM does that really makes check pointing easy is it saves what it calls rpointer files. This "restart pointer" file is a text file that simply contains the name and path of the last restart file. For example a rpointer file written by VIC on 2008-01-01 would look like this: RUNNAME.vic.r.2008-01-01-00000.nc
The last thing in the CESM approach is then to have an option to differentiate between "startup" and "continue" runs. When a run is a "continue" run, the rpointer file is use to read in an initial state. When a run is a "startup" run, the model is either initialized with another initial state file or is started up "dry".

So basically, you could save state files without stopping your run and then you could arbitrarily restart a run using a rpointer file. All this has to happen for the RASM driver that I'm working on except for the development of the write_state() function.

Are you mostly interested in seeing this happen in VIC.4 for your current project or do you see this being a useful feature down the road? If it for VIC.4, we probably wont have another minor release so you would need to do this on your own personal branch, otherwise, maybe we could work together to get this in VIC.5.

This functionality would depend a bit on the driver. As described here, the functionality is specific to the classic mode, since in that case the state file for a given date gets written incrementally (as each grid cell reaches that date - and then runs past it. It makes sense not to rerun the model for cells that have completed.

This will be somewhat different in all the image modes, in which all grid cells reach the same time step at the same time (approximately) and each state gets written to the statefile as a spatial field (in a single nc_put call). In that case you want the functionality that @jhamman describes: Write frequent states during the model run and then restart as he describes.

On Jan 15, 2015, at 2:51 PM, mrstu notifications@github.com wrote:

When a VIC simulation is interrupted, an option to checkpoint based on the statefile would be helpful. That is to say, if I resubmit the interrupted simulation, instead of re-running the first active cell in the soil file (and all others that ran successfully), the model would first check the output statefile to see which cell was last completed. Then, rather than starting at the first active cell the model would skip to the next active cell after the last completed. Furthermore, the statefile would be appended to so that upon completion of the full domain the end state file would be fully populated for all active cells (those that initially succeeded and those that eventually succeeded after re-submission(s)).

A second (more intricate) level of functionality this feature might include would be checkpointing based on a combination of the statefile being successfully written and the simulation successfully reaching the end of the modeling window. This would serve to catch the case where the user wants state written out sometime before the end of the modeling window.

— Reply to this email directly or view it on GitHub.

@jhamman, this is definitely not critical for my current project (using VIC.4) but I hit the corner case and thought it would be a nice feature to consider.

@bartnijssen, yes, definitely a "classic" enhancement.

Thanks guys. Very interesting context.

On Thu, Jan 15, 2015 at 5:03 PM, Bart Nijssen notifications@github.com wrote:

This functionality would depend a bit on the driver. As described here, the functionality is specific to the classic mode, since in that case the state file for a given date gets written incrementally (as each grid cell reaches that date - and then runs past it. It makes sense not to rerun the model for cells that have completed.

This will be somewhat different in all the image modes, in which all grid cells reach the same time step at the same time (approximately) and each state gets written to the statefile as a spatial field (in a single nc_put call). In that case you want the functionality that @jhamman describes: Write frequent states during the model run and then restart as he describes.

On Jan 15, 2015, at 2:51 PM, mrstu notifications@github.com wrote:

When a VIC simulation is interrupted, an option to checkpoint based on the statefile would be helpful. That is to say, if I resubmit the interrupted simulation, instead of re-running the first active cell in the soil file (and all others that ran successfully), the model would first check the output statefile to see which cell was last completed. Then, rather than starting at the first active cell the model would skip to the next active cell after the last completed. Furthermore, the statefile would be appended to so that upon completion of the full domain the end state file would be fully populated for all active cells (those that initially succeeded and those that eventually succeeded after re-submission(s)).

A second (more intricate) level of functionality this feature might include would be checkpointing based on a combination of the statefile being successfully written and the simulation successfully reaching the end of the modeling window. This would serve to catch the case where the user wants state written out sometime before the end of the modeling window.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/UW-Hydro/VIC/issues/195#issuecomment-70192511.

For the classic driver, you may be able to find a way to save how many grid cells have been finished and then restart where you left off. So rather than have an rpointer file that has a path, it would have a grid cell index. You would just need to make sure the statefile was properly closed or flushed to after each grid cell completes.

That said, once we have the image driver running with mpi and full netCDF I/O, I don't see many cases where you would use the classic driver for large jobs.

UW-Hydro / VIC

statefile checkpointing #195