TUM-DAML / seml

SEML: Slurm Experiment Management Library
Other
164 stars 29 forks source link

Feature request: Automatic hyperparamter optimization #123

Open heborras opened 8 months ago

heborras commented 8 months ago

Hi all,

a use case for which I've been using seml a lot is grid searches for finding good hyperparamters for neural nets. And it would be great if one could automate this process somewhat, since at the moment one has to go from a coarse grain search to a fine grained one in multiple dimensions at the same time. Which usually involves lots of looking at plots and running analysis scripts. It would be nice if I could instead basically tell seml: Please run a thousand experiments and find a good hyperparameter config within these parameter ranges. And with a good hyperparamter searcher the result would likely be better than what I could have come up with. I know seml is somewhat agnostic to what for a program one runs, not limited to neural nets, but fundamentally many published hyperparamter searchers also are agnostic in this way (though many do focus on NNs). So I think fitting algorithms can be found in literature.

So I was wondering: Do you think there is a way to extend seml in such a way that it could support such hyperparamter search algorithms? Or rather: Do you see this as a possible extension in the future? Or is this out of scope for seml?

Some general thoughts of mine on this are: One would obviously need to move away from the somewhat static principle of returning results only at the end of a run and one would probably also need to introduce a way to make seml be able to dynamically create/fetch hyperparamters as slots in slurm become available. These are likely somewhat drastic changes to the codebase, but some of the fundamentals already exist in sacred (even though if I understood the general sentiment correctly, then seml wants to move away from sacred in the long run).

A little bit more context: In the past I've been a big fan of what determined.ai does with their Adaptive ASHA searcher: https://docs.determined.ai/latest/model-dev-guide/hyperparameter/search-methods/hp-adaptive-asha.html But their system doesn't play nicely with slurm on a conceptual level. Probably what Weights and Biases does is conceptually more in-line with how seml works: https://docs.wandb.ai/tutorials/sweeps#-pick-a-method However, in general the field of automatic hyperparamter optomization has been a very active one and I think one of the most feature complete suite of searchers is Syne Tune, which is however tightly coupled to aws: https://github.com/awslabs/syne-tune Still, maybe one can use one of these as an initial spring-board to get started.

I'd be happy to put some dev-time into this in the next year. Since I am seeing many more hyperparamter searches with seml in my near future. That is however dependent on if there is enough interest for such a feature and if there is willingness to maintain it afterwards as well. So if this were to happen, then I think some good coordination would be required, so that people are on the same page.

n-gao commented 8 months ago

Hi @heborras, Thanks a lot for your great interest in seml! :) I am not very familiar with automatic hyperparameter searches, but the current infrastructure of seml does not fully allow this. This doesn't mean it cannot be accomplished; it will take some effort. This will be linked to #32. While #32 is about acyclic graphs, we need a more general implementation that enables arbitrary conditions, including stopping conditions. For me, it makes sense to kill two birds with one stone and implement #32 while at it.

To store the state of the HP search, I could imagine this being somewhat implemented by having a "special" collection that manages the state of the searches. Then at the end of each experiment, a small script would have to be executed to check what the most sensible next jobs would be. Obviously, one would have to check for concurrency issues etc. if multiple jobs terminate at the same time. My main concern here would be that such a special collection would feel kinda out of place for seml. However, including the hyperparameter search in the collection itself would require the treatment of a special document at every place.

In terms of definition in a yaml file, this would probably be a separate definition type rather than fixed and grid, one could define search.

If you are willing to work on this, that would be amazing! We can have a chat discussing how one would go about implementing this! :)

heborras commented 7 months ago

Hi @n-gao,

thanks for the really quick reply :) I've been an avid seml user, ever since we worked on a large hyperparmeter search some time last year, for the following paper: https://arxiv.org/pdf/2212.10430.pdf

If I understand #32 correctly, it seems that this feature might be accomplishable without the need for any changes in the collection / database entries. And the feature request is mainly about seml being able to build the workflow graph, figuring out which experiments have already been completed (by some previous run) and then executing the graph as multiple submissions to slurm with the dependency argument of sbatch to have slurm take care of the dependencies automatically. Probably, the afterok is the correct type of dependency to use here: https://slurm.schedmd.com/sbatch.html#OPT_afterok:job_id[:jobid...]

And I think that this would indeed be a very interesting feature to have, as one could easily automate prerequisites, such as having pre-trained models created automatically if required. However, I'm not quite sure if the feature overlaps that much with the HP search feature. Because for an HP search one would want to have as few inter-dependencies as possible, ideally a new experiment with new parameters would start as soon as resources become available, even if none or few of the previous experiments are complete.

Where to store the HP searcher state is indeed a tricky question and I personally also haven't come to a good conclusion yet. Additionally for features like early-stopping most HP searchers require full knowledge of the current progress of all workers in the pool. So one would either need to have all experiments running their own search manager, which updates regularly from the mongodb. Or one would need one central arbiter, which might be part of one of the slurm jobs already running and gets passed along as jobs terminate.

As for concurrency issues with multiple jobs terminating or early stopping at the same time, this can probably be handled by synchronizing via the MongoDB. As there are locking mechanisms at pretty much every hierarchy level, down to the document level. But yeah, it's something one needs to pay attention to.

Having a chat about this would indeed be amazing. Would maybe Thursday or Friday next week work for you?