PetaVision / OpenPV

PetaVision is a C++ library for designing and deploying large-scale neurally-inspired computational models.
http://petavision.github.io
Eclipse Public License 1.0
40 stars 13 forks source link

Checkpoint upon AWS spot instance termination notice #29

Open garkenyon opened 8 years ago

garkenyon commented 8 years ago

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices

the above link states that AWS provides a 2 minute warning before termination. Can we use this warning the same way we use

$: killall -SIGUSR1

to write a final checkpoint before termination? In fact, we almost don't even have to formally checkpoint with the above mechanism.

peteschultz commented 8 years ago

In HyPerCol::advanceTime(), in the same spot we call sigpending to check for SIGUSR1, we could do the curl statement (or maybe it would be wget), and if there is a termination warning we could set checkpointSignal to 2 (sending SIGUSR1 sets checkpointSignal to 1). I think we'd want to make sure we don't fetch the URL more often than the Amazon-recommended 5 seconds, but it should be pretty straightforward to add.

Alternatively, we could have PV_Init launch a simple script that runs the curl statement every 5 seconds, and sends SIGUSR1 to the PetaVision process when necessary. One thing about that is we might want to be able to see in the log file whether the job terminated from Amazon killing the instance or from the user running killall -SIGUSR1.

garkenyon commented 8 years ago

the first approach seems easier to implement. maybe we could keep track of the last wget/curl AWS termination check to make sure we don't check too often. 2 minutes is a long time. Just ask Peyton Manning! Since we would at most only be checking as often as we check sigusr1, there's no reason to check the termination condition more often that that.