geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
223 stars 235 forks source link

Is checkpointing a significant time sink? #3921

Open bangerth opened 3 years ago

bangerth commented 3 years ago

I'm listening to the annual reviews of the Exascale Computing Project and learned about the VeloC project (https://www.anl.gov/mcs/veloc-very-low-overhead-transparent-multilevel-checkpointrestart) that provides checkpointing services that, for example, do the actual I/O in the background.

For those of you who have run computations on 10,000 or more processors, is checkpointing a concern on large machines? The amounts of data that need to be written are certainly huge, but I don't know whether it is something that needs to be addressed.

bangerth commented 3 years ago

Follow-up: The good people from VeloC have spent a good amount of time measuring efficiency of various serialization libraries. As these things go, the one we're using (BOOST) came out at the bottom: it's about 10x slower than the best libraries:

If it turns out that serialization is ever a bottleneck, that's where we ought to look.

bangerth commented 3 years ago

See also the table here: https://github.com/fraillt/bitsery