Open komatits opened 10 years ago
See also Leonardo Bautista-Gomez, Dimitri Komatitsch, Naoya Maruyama, Seiji Tsuboi, Franck Cappello, Satoshi Matsuoka and Takeshi Nakamura, FTI: high performance Fault Tolerance Interface for hybrid systems, Proceedings of the ACM / IEEE Supercomputing SC'2011 conference, article #32, p. 32:1-32:12, doi: 10.1145/2063384.2063427 (2011).
http://komatitsch.free.fr/preprints/sc2011_Leonardo_Bautista_Gomez.pdf
I will soon implement a basic fault-tolerance (more precisely: fail-safe) mechanism based on an idea from Daniel Peter @danielpeter .
add a checkpointing/restarting system with a checksum system as in SPECFEM3D_GLOBE/tags/v4.1.0_beta_merged_mesher_solver_non_blocking_MPI/src , or using the fault tolerance library developed by Leonardo Bautista and Franck Cappello at INRIA (see the content of directory "utils/fault_tolerance").
Should probably be done based on ADIOS.
From Craig P Steffen at NCSA:
Make checkpointing/restarting easier to use and not limited to a NUMBER_OF_RUNS > 1; i.e. on big machines we should be able to use the existing routines to checkpoint even when NUMBER_OF_RUNS == 1, in case the run fails and needs to be restarted. This will be useful on very big machines, in particular in SPECFEM3D_GLOBE.