PetaVision / OpenPV

PetaVision is a C++ library for designing and deploying large-scale neurally-inspired computational models.
http://petavision.github.io
Eclipse Public License 1.0
40 stars 13 forks source link

Interrupt handling #295

Closed peteschultz closed 5 years ago

peteschultz commented 5 years ago

This pull request provides a mechanism to send signals for creating an immediate checkpoint and exiting cleanly. The difference between this request and https://github.com/PetaVision/OpenPV/pull/294 is that this pull request goes into the develop branch.

If any of the signals SIGINT, SIGTERM, or SIGUSR2 are sent to the global root process, the signal is broadcast to the rest of the processes, an immediate checkpoint is performed, and the processes then call MPI_Finalize() and exit().

Sending SIGUSR1 to the global root process still causes an immediate checkpoint but does cause the job to exit.

Nonroot processes do not respond to these signals directly; instead they receive a broadcast message from the root process.

If checkpointWrite is false, the checkpoint is written to the path specified in lastCheckpointDir, as if stopTime had been set to the timestep when the signal was sent.