TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
165 stars 29 forks source link

Feature request: restaging killed experiments without deleting info #101

Closed AnnevanGils closed 10 months ago

AnnevanGils commented 1 year ago

Due to running experiments on a cluster where any job gets killed automatically after 24h, I'm looking for a way to restage killed experiments in order to continue training, without having the info and captured_out fields etc. deleted from the mongodb. Is there any functionality in SEML at the moment that I can look into for this purpose? Any directions would be greatly appreciated.

n-gao commented 1 year ago

To reschedule killed experiments you could first reset and then start the experiments again:

seml <collection> reset start

This will however overwrite several fields like captured_out. At the moment it is hardcoded in https://github.com/TUM-DAML/seml/blob/e340dc4de0c472839fdda811e8b90d69fd6e54e0/seml/manage.py#L248 which keys are preserved. I am not sure how fields like captured_out behave from the sacred side of things. Which properties do you need to be preserved?

AnnevanGils commented 1 year ago

Thanks a lot for your reply. The properties I would like to be preserved besides captured_out, are any properties that get added by the code during the run (in my case this property is named info), and any information about previous runs like start_time (and stop_time if it exists), but for the latter surely they need to be structured differently so they can hold multiple start times, distinguished between runs.

I can imagine the main difference lies in the use case for the reset command, generally when resetting an experiment the use case is not to preserve info from previous (failed) runs as well, as that would be undesirable. That's why I was looking for a restage or continue-like command, where only those properties that absolutely need to be reset for the sake of seml or sacred's functioning get reset, and all others don't and are appended instead. I have noticed that only changing status from KILLED to STAGED is not sufficient for seml so I can imagine this is would be somewhat complicated to implement.

In any case thanks for the reply, I will try to figure out a different way to preserve info.

n-gao commented 10 months ago

I think this is out of scope and belongs to sacred. If you'd like to implement such a feature, please reopen this issue and open a PR.