Adhering to computation cost budget better

Bronzila commented 1 year ago

The current implementation waits for all started jobs when the runtime budget is exhausted. This does make sense when using function evaluations or number of iterations as budget, but not when specifying the maximum computation cost in seconds.

Toy failure mode: The computational budget is 1h, but a new job, that would e.g. take 30 mins, is submitted after 59 mins of optimization. Then the optimizer would wait for this job to finish and therefore overshoot the maxmimum computational budget of 1h.

For now a quick fix could be simply stopping all workers when the runtime budget is exhausted, however this would result in potentially lost compute time. Therefore it might also be interesting to think of a way to checkpoint the optimizers state in order to resume training.

eddiebergman commented 1 year ago

You could look at this answer using sched: https://stackoverflow.com/a/474543/5332072

From there dask has a way to essentially shutdown() the Client and the close() it.

Neeratyoy commented 1 year ago

@eddiebergman what do we do with the interrupted evaluation? assuming it is a deep learning model training as an evaluation, is it okay to still exceed the runtime to trigger saving the current state? @Bronzila feel free to share your thoughts too

eddiebergman commented 1 year ago

Based on a lookover, the "hot-loop" is here, with the break condition here: https://github.com/automl/DEHB/blob/54ce41c4c516e38aefc5944a2b677b95cfa2e05a/dehb/optimizers/dehb.py#L750-L751

To return on time

I would probably do something along the lines of this for the dask case, this should basically kill all jobs running in dask and wait for all of them to return. This wait part isn't fulllly necessary but in principal it should be fine.

self.client.close()
for future in self.futures:
    future.cancel()

concurrent.futures.wait(self.futures, "ALL_COMPLETED")

Dask has the property that you can cancel running jobs, but in the non-dask case (here), where you're just raw dogging the function, you can't cancel it because it's in the same process. Killing it would mean killing the whole thing.

https://github.com/automl/DEHB/blob/54ce41c4c516e38aefc5944a2b677b95cfa2e05a/dehb/optimizers/dehb.py#L572-L574

To circumvent this, you would need to run it in a subprocess of some kind and use psutil to effectively kill the process.

To inform the process so you can save

This is much harder, especially when you don't control the target function. The first thing you need is the handle of the process that is running the target function. Then you can send a SIGTERM to the process with .terminate()).

process = psutil.Process(<process-id of the thing to signal>)
process.terminate()

The correct procedure here by OS standards is to cleanup the program and finish soon. The way to do this is to use pythons signal module, more over, this function:

import signal

def callback(signal_num, framestack) -> None:
    # ... cleanup, save a model, whatever

signal.signal(signal.SIGTERM, callback)

The tricky part is that users have to specify this, i.e. their target function is going to be called and this callback has to be registered once inside the process that is running the target function. I do not know how you'd like to do that. I think your best approach is simply give an example and move on. Trying to automatically handle this stuff would be a nightmare to do and maintain.

P.s.

This won't work if using a custom remote dask server, as you have no way to send a signal to this other machine running the process (or maybe dask does?), only if things are done with local processes. Perhaps dask has some unified way of handling this

automl / DEHB