LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

SW4: error when deleting a directory from the parallel file system #542

Open adammoody opened 1 year ago

adammoody commented 1 year ago

One can delete files from the parallel file system by either calling SCR_Delete() or by setting SCR_PREFIX_SIZE=N, in which case, SCR maintains a sliding window of the N most recent checkpoints. In either case, SCR is throwing the following error when deleting a checkpoint:

Deleting dataset 2 `cycle=200' from `/p/lustre1/user123/problem'
SCR v3.0.0 ERROR: rank 0 on tioga23: Error deleting directory: /p/lustre1/user123/problem/subdir (rmdir returned -1 Directory not empty) @ /scr-v3.0.1/scr/src/scr_io.c:870