automl / HpBandSter

a distributed Hyperband implementation on Steroids
BSD 3-Clause "New" or "Revised" License
609 stars 109 forks source link

Saving the state? #56

Open netheril96 opened 5 years ago

netheril96 commented 5 years ago

I am surveying different packages for hyperparameter optimization, and HpBandSter seems promising, especially becaues of its support for distributed training. But one thing I haven't had a clue is how the master handles interruption. Typically training a model takes a long time, so the master should be alive for even longer (it must outlive all workers combined). But what happens when the master crashes/is preempted?

sfalkner commented 5 years ago

Then the whole optimization run will crash. You will be able to resume it, if you logged the intermediate results. Resuming here means that the master can build the same model as before, but running jobs from any workers will not be recovered. In case you ask that because you want to run everything on a cluster where there is a fairly strict time limit for any jobs, I recommend running the master either on the login node or some other machine that is reachable from the compute nodes. Usually, the master doesn't crash. We hat runs over several days, up to two weeks I think, without any major problems.