SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
Adam wrote:
While running some test_api tests with a fair number of ranks (32 or 128 on 4 nodes), I'm noticing that it's taking a long time to return from a sync flush, and the time is also highly variable.
He found that the fsync was taking between 1 and 5 minutes.
Adam wrote: While running some test_api tests with a fair number of ranks (32 or 128 on 4 nodes), I'm noticing that it's taking a long time to return from a sync flush, and the time is also highly variable.
He found that the fsync was taking between 1 and 5 minutes.
See https://github.com/LLNL/scr/issues/449