allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

:bug: Recheck that the worker is still IDLE before taking it down #1240

Closed cthorey closed 2 months ago

cthorey commented 3 months ago

Related Issue \ discussion

This patch is motivated by this discussion Issue 1202

Patch Description

The patch double check that the worker is indeed still IDLE before spinning it down. The list of IDLE worker is refreshed at the beginning of the event loop and by that point it might not be accurate anymore.

Testing Instructions

Launching the autoscaler with max_idle_time=60s and queuing job at 60s interval - at some point some job gets picked up and the agent still get killed.

Other Information

As mention in the issue, this does not strictly solve the problem as the agent could picked up a task while the instance is being spinned down but it makes it less likely.