SCR experiencing slow fsync to lustre1 from quartz

LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.

http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi

Other

99 stars 36 forks source link

SCR experiencing slow fsync to lustre1 from quartz #450

Closed ofaaland closed 3 years ago

ofaaland commented 3 years ago

Adam wrote: While running some test_api tests with a fair number of ranks (32 or 128 on 4 nodes), I'm noticing that it's taking a long time to return from a sync flush, and the time is also highly variable.

He found that the fsync was taking between 1 and 5 minutes.

See https://github.com/LLNL/scr/issues/449

ofaaland commented 3 years ago

wrong ticket system!