PetaVision / OpenPV

PetaVision is a C++ library for designing and deploying large-scale neurally-inspired computational models.
http://petavision.github.io
Eclipse Public License 1.0
40 stars 13 forks source link

Checkpointing changes #305

Closed peteschultz closed 4 years ago

peteschultz commented 4 years ago

This pull request incorporates some changes to checkpointing.

First, idle counts are no longer printed to the log file for every phase in every timestep, leading to excessively large log files. Instead, cumulative idle counts are checkpointed using the file names (HyPerCol_name)_IdleCounts.bin and (HyPerCol_name)_IdleCounts.txt .

Next, SIGTERM and SIGINT are no longer intercepted by the signal handler; instead they revert to their defaut behavior, which probably kills the process immediately. If one of these signals arrives while a checkpoint is being written or deleted, the checkpoint may be left in an incomplete status. SIGUSR1 continues to mean write checkpoint immediately and continue; and SIGUSR2 continues to mean continue to the next checkpoint, write it, and then exit.

Finally, when reading from a checkpoint, with initializeFromCheckpointDir, or with checkpointReadDirectory or the Restart flag, the Checkpointer object checks for the presence of the timeinfo.bin file in the checkpoint directory. If using InitializeFromCheckpointDir or CheckpointReadDirectory, it is a fatal error of timeinfo.bin is not found. Since timeinfo.bin is the first file deleted when removing a checkpoint, and the last written when writing a checkpoint, this should catch attempts to read from an incomplete checkpoint. If using the Restart flag, the Checkpointer looks for the highest-index Checkpoint[nnnnn] directory that contains a timeinfo.bin file.

peteschultz commented 4 years ago

Fixing an error in the comment above. SIGUSR2 causes an immediate checkpoint and exit. (SIGUSR1 causes an immediate checkpoint and continue, as stated).