allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

Autoscaler kill workers that have just picked up a new task. #1202

Closed cthorey closed 4 months ago

cthorey commented 4 months ago

Describe the bug

I am not sure it's a bug per say but given the implementation (auto_scaler.py) which update idle_workers once at the beginning of the event loop and use the time reported there to decide upon which instance to spin down, I end up in cases (albeit not often) where a worker get spinned down even if he just picked up a new Task.

Would it not be better to check right before spinning down the worker if he is still IDLE ? I am referinig to line 325 in auto_scaler.py ?

ainoam commented 4 months ago

Makes total sense @cthorey - Would you care to issue a PR?

cthorey commented 4 months ago

I though about it but then I realize, we have still no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider. What would be better would be to be able to detect when the agent have been taken down and reschedule Task that have been interrupted this way.

I raise an issue https://github.com/allegroai/clearml-agent/issues/188 here which prevents this for now.

Specifically, when an instance is taken down, SIGTERM are sent to running processed and the running task are marked as completed. What would be better would be to mark them as fail so that we have the option to reschedule them via the retry_on_failure parameter which we can pass to the PipelineController.

ainoam commented 4 months ago

Sounds like we're mixing up a number of points @cthorey.

  1. Your original post - A race condition where an instances activity status is obsolete by the time the autoscaler takes action for taking it down.
  2. The status of a task once its executing agent was explicitly terminated (which you address in clearml-agent#188) and its effect on pipeline logic.

These should probably be handled independently. WDYT?

cthorey commented 4 months ago

Yep - I agree they should be handled independently. Regarding 1. and hence this issue, I think we can reasonably close it given that, as I said above, we have no way to guarantee that the agent does not pick up a new task while the instance is taken down by the cloud provider.