ECP-VeloC / VELOC

Very-Low Overhead Checkpointing System
http://veloc.rtfd.io
MIT License
52 stars 21 forks source link

VeloC and MPI IO #27

Closed denisbertini closed 4 years ago

denisbertini commented 4 years ago

Hi Just beginner question: is it possible to adapt a code using MPI- collective IO for its checkpointing files to VeloC? Thanks Denis

gonsie commented 4 years ago

Hey @denisbertini

You might be able to use the file-based mode that veloc offers. You call VELOC_Route_file with the file path that you would normally write to and instead use the passed-back file path. Veloc then handles the files and moves them around / provides fault tolerance.

denisbertini commented 4 years ago

Hi @gonsie Thanks a lot, i will try to use VeloC to leverage the checkpointing IO on our plasma physics simulation. This is using extensively MPI-IO, i hope to make it runnuing. Do you know examples already that i can look at ? Thanks Denis

gonsie commented 4 years ago

They only examples I know of are provided in the documentation (at the bottom of the page).

denisbertini commented 4 years ago

These example are using fwrite/fread but i suppose one can substitute the corresponding MPI_WRITE/MPI_READ functions in my case ...

bnicolae commented 4 years ago

Is there a particular reason why you need MPI-IO? It should be easier for you to protect the memory regions directly instead of writing them into a file using MPI-IO.

denisbertini commented 4 years ago

Well the program that i use is already using MPI-IO to do chekpointing and to dump the data files. Is there some limitation in this case ? I mean combining MPI IO and VeloC ?

denisbertini commented 4 years ago

When you say "protecting memory region directly" you mean what is done in the heatdis_mem example ?

bnicolae commented 4 years ago

Yes, this is what I mean. If you are using MPI-IO, then you are already writing to a parallel file system so there is no point in using VELOC. The idea of using VELOC is to checkpoint asynchronously and avoid paying for expensive I/O (which you do if you wait for MPI-IO to finish).

bnicolae commented 4 years ago

Also, please note that we have a mailing list you can subscribe to: veloc-users@lists.mcs.anl.gov. This is the right place to discuss such considerations. We use github issues primarily for bug reports.