SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
One can delete files from the parallel file system by either calling SCR_Delete() or by setting SCR_PREFIX_SIZE=N, in which case, SCR maintains a sliding window of the N most recent checkpoints. In either case, SCR is throwing the following error when deleting a checkpoint:
Deleting dataset 2 `cycle=200' from `/p/lustre1/user123/problem'
SCR v3.0.0 ERROR: rank 0 on tioga23: Error deleting directory: /p/lustre1/user123/problem/subdir (rmdir returned -1 Directory not empty) @ /scr-v3.0.1/scr/src/scr_io.c:870
One can delete files from the parallel file system by either calling
SCR_Delete()
or by settingSCR_PREFIX_SIZE=N
, in which case, SCR maintains a sliding window of theN
most recent checkpoints. In either case, SCR is throwing the following error when deleting a checkpoint: