Add support for checkpoints and restarts

markcoletti commented 3 years ago

A running EA should be able to checkpoint (save state) and be able to later restart from that checkpoint. This is particularly important for HPC contexts where batch job wall time budgets can cause a running job to halt before finishing.

lukepmccombs commented 1 year ago

Creating a checkpoint saving operator should be trivial. Pickling is already a necessity to support distributed workloads, so creating pickles of populations should be without conflict.

Loading may be a bit more difficult, but I'd think anywhere init_pop_size is used could have a secondary, exclusive parameter that is a pre-constructed population, say init_pop. That opens it up for more use cases too, like transferring populations between algorithms.

markcoletti commented 1 year ago

Yes, I think pickling and unpickling for checkpoints and restarts would be the easy part. The harder bit will be figuring out when to do checkpoints. (Knowing when to do restarts is obviously easy. ;) )

I can see a couple approaches to this that are not necessarily mutually exclusive.

Have a checkpoint operator that updates a pickle file of the current (or given) population
When a UNIX signal is raised (as happens on Summit 10 minutes before the job ends), do a checkpoint

There is some nuance to consider. For example, say you're doing hyperparameter optimization for a deep learner where training can take, oh, an hour. The system sends a signal that job death is eminent, so time to save that work and exit gracefully. Sure, pickling the current population of already evaluated individuals is straightforward, but maybe you'd like to later restart current running training from the last epoch? Most deep-learning frameworks have a checkpoint/restart system, so presumably the different training runs can detect external UNIX signals, stop training, and save one last checkpoint. But since those individuals aren't in the current population because they're literally in the middle of being evaluated, how do you know to resume training upon restart to complete fitness evaluation? Clearly some bookkeeping needs to happen to track all that.

markcoletti commented 1 year ago

It may be worthwhile to see how DEAP and other EC frameworks handle checkpoints and restarts, though I imagine they took the path of least resistance and just worry about pickling/unpickling the one population.

lukepmccombs commented 1 year ago

DEAP doesn't have an explicit method of checkpointing from their documentation, simply a suggestion to use pickling for it. Their example also just used periodic checkpoints, nothing fancy with detecting signals.

markcoletti commented 1 year ago

DEAP doesn't have an explicit method of checkpointing from their documentation, simply a suggestion to use pickling for it. Their example also just used periodic checkpoints, nothing fancy with detecting signals.

Wow! I'm kinda surprised. Well, this may be an opportunity to whip up something better than just pickling/unpickling populations, particularly with HPC contexts in mind.

lukepmccombs commented 1 year ago

Resuming deep learners is a bit finicky. The actual resuming within evaluation I think would be the jurisdiction of the problem class. But retaining the currently evaluating individuals brings along some issues. I think it would be confusing having both an already evaluated initial pop as well as the evaluation queue.

For checkpointing in general, I'm realizing it may be better to create a custom Representation (or initializer?) that loads it up instead, feeding it out as the initial pop. It could hide away much of the complexity fairly easily. Although, I'm still uncertain what the evaluation queue would look like in this context, and we'd need to consider whether / how not to re-evaluate the checkpointed individuals.

lukepmccombs commented 1 year ago

The usual process, thinking only of generational for now, is

parents = representation.create_population(pop_size, problem=problem)
parents = init_evaluate(parents)

The evaluation is untied to the representation and supplied by the user. We could add an optional parameter to override it fairly easily (ok), have the user fill it in with a dummy (clunky), or create to a new init evaluate function that conditionally evaluates them based on whether they have a fitness (potentially surprising). I'm leaning towards the last, so long as it is well named and documented for this particular use case.

markcoletti commented 1 year ago

Or, have a restart boolean that, if true, just skips those steps and falls right into the offspring creation cycle.

Another thing I've done in the past is monitor evaluation times. So, if one knows the wall clock time for when a job ends, and given the average and standard deviation of evaluation times, it's possible to estimate whether there's enough time to evaluate a new individual given the remaining wall clock time --- then the evaluations drain down until the job ends. If any are still cranking when the job terminates, oh well.

AureumChaos / LEAP

Add support for checkpoints and restarts #155