HDF5 checkpoint format - Githubissues

Helmholtz-AI-Energy / propulate

Propulate is an asynchronous population-based optimization algorithm and software package for global optimization and hyperparameter search on high-performance computers.

https://doi.org/10.1007/978-3-031-32041-5_6

BSD 3-Clause "New" or "Revised" License

32 stars 7 forks source link

HDF5 checkpoint format #8

Open mcw92 opened 1 year ago

mcw92 commented 1 year ago

Implement HDF5 checkpointing instead of pickles for better interoperability.

Markus-Goetz commented 1 year ago

This is needed for checkpoints for long running individuals (exceeding a singular job run)

oskar-taubert commented 1 year ago

Add run config to the checkpoint and validate it on resume. For now a run with inconsistent migration topology or number/distribution of workers should probably warn and crash. In the future it might set up a fresh checkpoint as to preserve and separate the different runs.