Closed Bronzila closed 2 months ago
You could look at this answer using sched
: https://stackoverflow.com/a/474543/5332072
From there dask has a way to essentially shutdown()
the Client
and the close()
it.
@eddiebergman what do we do with the interrupted evaluation? assuming it is a deep learning model training as an evaluation, is it okay to still exceed the runtime to trigger saving the current state? @Bronzila feel free to share your thoughts too
Based on a lookover, the "hot-loop" is here, with the break condition here: https://github.com/automl/DEHB/blob/54ce41c4c516e38aefc5944a2b677b95cfa2e05a/dehb/optimizers/dehb.py#L750-L751
I would probably do something along the lines of this for the dask case, this should basically kill all jobs running in dask and wait for all of them to return. This wait part isn't fulllly necessary but in principal it should be fine.
self.client.close()
for future in self.futures:
future.cancel()
concurrent.futures.wait(self.futures, "ALL_COMPLETED")
Dask has the property that you can cancel running jobs, but in the non-dask case (here), where you're just raw dogging the function, you can't cancel it because it's in the same process. Killing it would mean killing the whole thing.
To circumvent this, you would need to run it in a subprocess of some kind and use psutil
to effectively kill the process.
This is much harder, especially when you don't control the target function. The first thing you need is the handle of the process that is running the target function. Then you can send a SIGTERM
to the process with .terminate()
).
process = psutil.Process(<process-id of the thing to signal>)
process.terminate()
The correct procedure here by OS standards is to cleanup the program and finish soon. The way to do this is to use pythons signal
module, more over, this function:
import signal
def callback(signal_num, framestack) -> None:
# ... cleanup, save a model, whatever
signal.signal(signal.SIGTERM, callback)
The tricky part is that users have to specify this, i.e. their target function is going to be called and this callback has to be registered once inside the process that is running the target function. I do not know how you'd like to do that. I think your best approach is simply give an example and move on. Trying to automatically handle this stuff would be a nightmare to do and maintain.
This won't work if using a custom remote dask server, as you have no way to send a signal to this other machine running the process (or maybe dask does?), only if things are done with local processes. Perhaps dask has some unified way of handling this
The current implementation waits for all started jobs when the runtime budget is exhausted. This does make sense when using function evaluations or number of iterations as budget, but not when specifying the maximum computation cost in seconds.
Toy failure mode: The computational budget is 1h, but a new job, that would e.g. take 30 mins, is submitted after 59 mins of optimization. Then the optimizer would wait for this job to finish and therefore overshoot the maxmimum computational budget of 1h.
For now a quick fix could be simply stopping all workers when the runtime budget is exhausted, however this would result in potentially lost compute time. Therefore it might also be interesting to think of a way to checkpoint the optimizers state in order to resume training.