LLNL / scr

SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
http://computing.llnl.gov/projects/scalable-checkpoint-restart-for-mpi
Other
99 stars 36 forks source link

SCR for NVRAM #168

Open harellevin opened 4 years ago

harellevin commented 4 years ago

From some digging around on the internet, I saw that SCR can utilize NVRAM devices using external libraries such as CRUISE and PERM. Does SCR provide an internal mechanism to flush the checkpoints directly to the NRAM (such as using "mmap" or PMDK)?

adammoody commented 4 years ago

@harellevin , I think the answer to your question is probably "no", but I'll elaborate a bit.

Both SCR and the application that uses SCR use standard libc/POSIX calls to open/read/write/close files. SCR can thus work with any NVMEM that happens to expose a file system interface.

CRUISE is a research prototype in which we intercepted libc/POSIX read/write calls from an application and then turned those into memcpy calls behind the scenes in order to layer a file system interface on top of backing storage that was exposed as memory load/stores. This allows an application using SCR to transparently access data that is actually stored in memory space, but it makes a number of assumptions that might be specific to checkpoint/restart workloads.

PERM allows one to mark memory regions of an application that get persisted to a memory mapped file with a flush call. The integration work with SCR was then to apply the cross-node resiliency on those files, but SCR was still accessing that data through file interfaces.

So in short, SCR can use NVRAM, but only NVRAM that exposes its data storage through file system calls. More work is required to support native memory load/stores.

A project related to SCR called VeloC provides an interface for an application to register memory regions that the VeloC library will then persist to a file which is then protected with cross-node resiliency.

@kathrynmohror , might have more to add.

Does that help?