allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Autoscaler fails on spot instances and other instance errors #979

Open idantene opened 1 year ago

idantene commented 1 year ago

Describe the bug

We're using self hosted ClearML with the example AWS autoscaler, and have noticed the following two bugs (supposedly related to one another?):

  1. When an instance (spot or otherwise) crashes with e.g. "out of disk space", the autoscaler does not terminate the instance. The instance is left running, yielding higher costs.
  2. We have set up a spot queue (amongst others). When a spot instance is terminated by AWS, the autoscaler fails to recognize this and the task status is stuck on "running". I understand this is not a bug per se, since the agent was shut down, but it would be great if the autoscaler would identify these tasks and either mark them as failed or restart them.

Environment

Self-hosted ClearML, latest SDK and server versions.

Related Discussion

Slack

jkhenning commented 1 year ago

Hi @idantene ,

When an instance (spot or otherwise) crashes with e.g. "out of disk space", the autoscaler does not terminate the instance. The instance is left running, yielding higher costs.

This situation is not trivial to detect (and we wouldn't like to terminate instances due to false alarms) - that's the reason this is not done at the moment...

We have set up a spot queue (amongst others). When a spot instance is terminated by AWS, the autoscaler fails to recognize this and the task status is stuck on "running". I understand this is not a bug per se, since the agent was shut down, but it would be great if the autoscaler would identify these tasks and either mark them as failed or restart them.

We'll add this to our todo list

idantene commented 1 year ago

Hey @jkhenning, thanks for the reply.

The reason I listed both in this issue is because I believe the solution is the same. As far as I can tell, a remote agent basically sends the occasional "ping" for a running task (or similar, e.g. to update the console tab). The autoscaler can then use two arguments to control the desired behavior. One would determine how much time is allowed to pass without an update from an agent, before that agent's machine is deemed stale. Stale machines should always be terminated. An additional setting would then be whether to restart the task (spawn a new task) or not. A smarter flow can also deduce from the system log what happened (for example: no error, instance no longer running -> spot instance terminated -> restart?, vs error/instance still running -> agent crashed -> mark failed and terminate instance).

jkhenning commented 1 year ago

Hi @idantene, "no error" is a fairly complicated terminology - how would you deduce with certainty by parsing text that some disk error has occurred?

idantene commented 1 year ago

I'm suggesting to not do that at all. Since the autoscaler keeps tracks of the instances, and since it knows which instance is running which task, and since the tasks get updated/polled every N seconds (which requires the remote agent to be running), I'm suggesting:

  1. The autoscaler monitors its tasks for these updates
  2. Once a task has gone without updates for some autoscaler.stale_minutes minutes: a. If the corresponding instance is still running -> the conclusion is that the agent crashed. Terminate the instance, mark the task as failed. b. Else (the corresponding instance is no longer running), and if the instance is a spot instance -> the spot instance terminated. Restart the task (or mark as failed, configurable via a flag).