DLR-AMR / t8code

Parallel algorithms and data structures for tree-based adaptive mesh refinement (AMR) with arbitrary element shapes.
https://dlr-amr.github.io/t8code/
GNU General Public License v2.0
132 stars 52 forks source link

Feature request: partition-independant checkpoint/restart #935

Open dutkalex opened 7 months ago

dutkalex commented 7 months ago

Hi!

Another question about t8code's capabilities: would it be possible to dump a file on disk with the current state of the mesh adaptation, in order to eventually resume mesh adaptation later?

Such a feature would be very useful IMO since it would broaden the scope of t8code uses and enable the development of efficient AMR-aware tools for solvers already using t8code. For example:

It seems to me like such a feature could boil down to dumping into a single file all the Morton keys of the elements of the refined mesh, in the order defined by the space-filling curve, as all the needed information to rebuild the mesh could be retrieved quite easily with the combination of the original coarse mesh file and the "SFC file". What are your thoughts about this?

Thanks for the great work you guys do with t8code Best regards, Alex

tim-griesbach commented 7 months ago

Lately, we have been thinking on partition-independent mesh and mesh-related data I/O on the p4est side and we would be happy to share future functionalities.

One future possibility to achieve the capabilities described by @dutkalex is scda, a serial-equivalent format for parallel I/O, which will be available in t8code's submodule libsc. A first file format specification is described in this preprint. A more recent version of the scda API can be found in libsc. We will update the preprint to the adjusted API soon.

The file format comes with partition-independence and is also designed to leverage scalable parallel I/O operations by using MPI I/O.

It seems to me like such a feature could boil down to dumping into a single file all the Morton keys of the elements of the refined mesh, in the order defined by the space-filling curve, as all the needed information to rebuild the mesh could be retrieved quite easily with the combination of the original coarse mesh file and the "SFC file". What are your thoughts about this?

The parallel output to disk of all mesh elements can be done by sc_scda_fwrite_varray, which can use a given contiguous partition (like it is induced by a space-filling curve) of the mesh elements and supports a variable data size per mesh element. The variable size per element is important to support hybrid meshes and it also enables the per-element data compression (section 3 in the preprint).

The scda file format allows a file to consist of an arbitrary number of so-called file sections of different types (section 2 in the preprint). In particular, one could write mesh-related application data to the same file. Moreover, one can write global data to the a file (cf. section 2.4 in the preprint and sc_scda_fwrite_block). Of course, there are analogous functionalities for reading of the respective file sections -- also using parallel I/O if applicable.

All in all, we think that scda will enable checkpoint/restart functionalities in t8code without too much required addition in t8code. We welcome feedback, discussions and suggestions to further specify scda where required!

dutkalex commented 7 months ago

@tim-griesbach this looks very promising! However, this is completely out of the scope of my personal expertise, but from an outside perspective I think it would be great if such a support was provided in libsc, because it would benefit more that just t8code's users