Ensemble Checkpointing - Githubissues

FLAMEGPU / FLAMEGPU2

FLAME GPU 2 is a GPU accelerated agent based modelling framework for CUDA C++ and Python

MIT License

103 stars 20 forks source link

For large ensembles, resuming a partially complete ensemble could be very useful.

I.e. to support pre-emptable jobs on HPC, to run ensembles which would hit a runtime limit on HPC, or to finish a mostly complete ensemble which was interupted for some reason, such as power loss.

For ensembles this is simpler than individual simulations which may include non-determinism (see #806), as the enesemble can just re-start any simualtions which did not finish, but not run simualtions which have completed.

I.e. if enabled, save the RunPlanVec state to disk, and if found on disk resume from it?

There are probably a number of edge cases to worry about (enesemble logging etc) but this shouldn't be too difficult for ensembles.

This should probably be an opt-in feature, as it may not be feasible in all cases.

This should be much less costly than simulation checkpointing, as there should be no GPU-CPU copies required, just periodic saving of ensemble state.

Maybe a new CLI flag to enable the use of this feature, for both outputting and resuming.

This shouldn't be too difficult. Check status of logged outputs, and then erase the start of your RunPlanVec. Not particularly automated, but hardly complicated either. Adding a CUDAEnsemble flag which skips the first N elements of the RunPlanVec should be trivial to do.

In terms of an output to automate this, just create a basic plaintext file, which ouputs each completed RunPlan ID to a line in the file at completion. This can then be automatically deleted on success. It could be useful to keep this file regardless though, as we currently don't log model crashes to anywhere but console. So an ensemble log file which includes crash details would be useful.

FLAMEGPU / FLAMEGPU2

Ensemble Checkpointing #807