Open idantene opened 1 year ago
Hi @idantene ,
When an instance (spot or otherwise) crashes with e.g. "out of disk space", the autoscaler does not terminate the instance. The instance is left running, yielding higher costs.
This situation is not trivial to detect (and we wouldn't like to terminate instances due to false alarms) - that's the reason this is not done at the moment...
We have set up a spot queue (amongst others). When a spot instance is terminated by AWS, the autoscaler fails to recognize this and the task status is stuck on "running". I understand this is not a bug per se, since the agent was shut down, but it would be great if the autoscaler would identify these tasks and either mark them as failed or restart them.
We'll add this to our todo list
Hey @jkhenning, thanks for the reply.
The reason I listed both in this issue is because I believe the solution is the same. As far as I can tell, a remote agent basically sends the occasional "ping" for a running task (or similar, e.g. to update the console tab). The autoscaler can then use two arguments to control the desired behavior. One would determine how much time is allowed to pass without an update from an agent, before that agent's machine is deemed stale. Stale machines should always be terminated. An additional setting would then be whether to restart the task (spawn a new task) or not. A smarter flow can also deduce from the system log what happened (for example: no error, instance no longer running -> spot instance terminated -> restart?, vs error/instance still running -> agent crashed -> mark failed and terminate instance).
Hi @idantene, "no error" is a fairly complicated terminology - how would you deduce with certainty by parsing text that some disk error has occurred?
I'm suggesting to not do that at all. Since the autoscaler keeps tracks of the instances, and since it knows which instance is running which task, and since the tasks get updated/polled every N seconds (which requires the remote agent to be running), I'm suggesting:
autoscaler.stale_minutes
minutes:
a. If the corresponding instance is still running -> the conclusion is that the agent crashed. Terminate the instance, mark the task as failed.
b. Else (the corresponding instance is no longer running), and if the instance is a spot instance -> the spot instance terminated. Restart the task (or mark as failed, configurable via a flag).
Describe the bug
We're using self hosted ClearML with the example AWS autoscaler, and have noticed the following two bugs (supposedly related to one another?):
Environment
Self-hosted ClearML, latest SDK and server versions.
Related Discussion
Slack