SPECFEM / specfem3d_globe

SPECFEM3D_GLOBE simulates global and regional (continental-scale) seismic wave propagation.
GNU General Public License v3.0
90 stars 95 forks source link

Add fault tolerance support, improve checkpointing/restarting, and also make sure it is not limited to NUMBER_OF_RUNS > 1 #25

Open komatits opened 10 years ago

komatits commented 10 years ago

add a checkpointing/restarting system with a checksum system as in SPECFEM3D_GLOBE/tags/v4.1.0_beta_merged_mesher_solver_non_blocking_MPI/src , or using the fault tolerance library developed by Leonardo Bautista and Franck Cappello at INRIA (see the content of directory "utils/fault_tolerance").

Should probably be done based on ADIOS.

From Craig P Steffen at NCSA:

Make checkpointing/restarting easier to use and not limited to a NUMBER_OF_RUNS > 1; i.e. on big machines we should be able to use the existing routines to checkpoint even when NUMBER_OF_RUNS == 1, in case the run fails and needs to be restarted. This will be useful on very big machines, in particular in SPECFEM3D_GLOBE.

komatits commented 10 years ago

See also Leonardo Bautista-Gomez, Dimitri Komatitsch, Naoya Maruyama, Seiji Tsuboi, Franck Cappello, Satoshi Matsuoka and Takeshi Nakamura, FTI: high performance Fault Tolerance Interface for hybrid systems, Proceedings of the ACM / IEEE Supercomputing SC'2011 conference, article #32, p. 32:1-32:12, doi: 10.1145/2063384.2063427 (2011).

http://komatitsch.free.fr/preprints/sc2011_Leonardo_Bautista_Gomez.pdf

komatits commented 9 years ago

I will soon implement a basic fault-tolerance (more precisely: fail-safe) mechanism based on an idea from Daniel Peter @danielpeter .