Open MSiggel opened 6 years ago
For runs that take longer than the set job runtime, it would be nice to support the restart of jobs (similar to the rob restart of MD-engines). How often a restart file is written should be determined by the user.
Ideally, the command to restart a simulation is the same for the first run and a restart run. This makes it easy to resubmit jobs. I suggest to have a flag like --with-restart
, if capriqorn is started with that flag it will look for a restart file and try to continue a run if no file is found the run is started from the beginning and restart files are generated.
I agree that this feature would be nice to have.
However recently implemented features such as the runtime estimation written by the sum worker and the flush feature which writes any data computed so far to HDF5 in a safe way should make life already easier a lot when working with batch systems with time limitation.
Restart should be implemented such that the output HDF5 file from the first run is scanned, and the run is then continued at the right frame, writing into the same HDF5 output file. No separate restart files seem necessary.
Currently, as far as I am aware there is no possibility to restart histogram calculations if they haven't completed. With large trajectories and system sizes it has often happended to me that the walltime was insufficient to complete my runs. Then, nothing was written to the h5 It would be useful if one could set a step size after which the histogram.h5 file is written similar to restarts in md simulations