allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.68k stars 654 forks source link

[Feature request] allowing closing workers from the UI #238

Open Mano3 opened 4 years ago

Mano3 commented 4 years ago

Hi, currently if I have a worker running in a detached daemon, I can only close it by daemon --stop. I would appreciate being able to close workers through the UI as well :)

bmartinn commented 4 years ago

Hi @Mano3 Is this like an online spin up/down instance on the cloud ? Are you using the aws_autoscaler ?

Mano3 commented 4 years ago

@bmartinn No. I'm talking about on-premise instances :) The spinning up feature is not as important to me as the spin down, but if there could be a true history of workers which I could control and spin them up and down that would even better.

bmartinn commented 4 years ago

@Mano3 I see... So is it more like enable/disable the trains-agent ?

Which would translate to "stop pulling jobs from queues" / "continue pulling jobs from queue", but it will not kill a running job?

Mano3 commented 4 years ago

This is a plausible implementation for deactivating a worker. But a feature to save worker's configs and being able to set them up from the UI (for example after server reboot) would be amazing. (I was more looking for a spin up / spin down on-premise worker for previously spun worker configs [for example if I had a worker that took gpus 2-3 on server a and used trains.conf located at /path/trains.conf , store its configs and be able to bring it up and down straight from the UI without having to call the trains-agent daemon.. command], but that would work too).

bmartinn commented 4 years ago

If I understand you correctly, this would serve two main purposes

  1. Free up resources for other our-of-scope process (i.e. not executed by the trains-agent)
  2. Online allocations of GPU's for an agent

If my understanding is correct, I think the "easiest" solution is integrating Kubernetes with the trains-agent. Basically you have k8s spin an agent (with specific resources controlled by k8s) then this agent connects to the trains-server and pulls jobs, when you decide to spin it down you use the k8s to spin the pod down. Obviously this means installing Kubernetes, which is never an easy integration, that said, once you have k8s running on the on-prem machines, adding k8s agent as another pod running on the nodes, is fairly easy :)

Mano3 commented 4 years ago

Hmm I fail to understand why there is need for K8s. if I use K8s i'm not sure why there is a need for trains-agent at all. What I ask is as follows: 1) When a worker is created, log it's configuration (GPUs to use, host, trains.conf, etc.) 2) Enable closing an active worker via GUI (same as trains-agent daemon --stop command) 3) Enable re-creation of stored workers (simply recreate the running command that was used to create these stored workers in the first place for all I know). If the machine is offline, simply print an error, that shouldn't be your concern.

bmartinn commented 4 years ago

why there is need for K8s

You need to have an agent on the machine, spinning the trains-agent process up/down, that agent could be the k8s daemon.

if I use K8s i'm not sure why there is a need for trains-agent at all.

A few points that immediately come to mind

  1. Because with k8s you need to package every job inside a container, and since jobs constantly change this is a lot of work
  2. There is no real scheduler built into k8s (basically job order / priority cannot be controlled)
  3. No UI to schedule jobs (obviously also including the lack of control for order / priority)
  4. It is way more complex to configuring a k8s job yaml vs trains job... Becuase k8s was designed for DevOps not ML data-scientists / engineers.
  5. Automation (see pipelines, HPO etc.) on top of k8s is way more complex (and unfortunately limited) than on top of trains (again, k8s was not designed for it)

Specifically regrading the suggestion: (2) / (3) means you actually have a daemon on the machine shutting down the trains-agent process... Obviously (2)/(3) could be implemented (kind of) with enable/disable flag for the trains-agent daemon, controlled the flag form the UI / RestAPI.

Unfortunately expanding the management capabilities of the trains-agent in such a way, is currently out of scope for the open-source project. The sugar-coating is, I'm pretty sure they have what you are looking for, and even more, in the enterprise edition :)