Open garkenyon opened 8 years ago
In HyPerCol::advanceTime(), in the same spot we call sigpending to check for SIGUSR1, we could do the curl statement (or maybe it would be wget), and if there is a termination warning we could set checkpointSignal to 2 (sending SIGUSR1 sets checkpointSignal to 1). I think we'd want to make sure we don't fetch the URL more often than the Amazon-recommended 5 seconds, but it should be pretty straightforward to add.
Alternatively, we could have PV_Init launch a simple script that runs the curl statement every 5 seconds, and sends SIGUSR1 to the PetaVision process when necessary. One thing about that is we might want to be able to see in the log file whether the job terminated from Amazon killing the instance or from the user running killall -SIGUSR1.
the first approach seems easier to implement. maybe we could keep track of the last wget/curl AWS termination check to make sure we don't check too often. 2 minutes is a long time. Just ask Peyton Manning! Since we would at most only be checking as often as we check sigusr1, there's no reason to check the termination condition more often that that.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html#spot-instance-termination-notices
the above link states that AWS provides a 2 minute warning before termination. Can we use this warning the same way we use
$: killall -SIGUSR1
to write a final checkpoint before termination? In fact, we almost don't even have to formally checkpoint with the above mechanism.