Open Mano3 opened 4 years ago
Hi @Mano3 Is this like an online spin up/down instance on the cloud ? Are you using the aws_autoscaler ?
@bmartinn No. I'm talking about on-premise instances :) The spinning up feature is not as important to me as the spin down, but if there could be a true history of workers which I could control and spin them up and down that would even better.
@Mano3 I see... So is it more like enable/disable the trains-agent ?
Which would translate to "stop pulling jobs from queues" / "continue pulling jobs from queue", but it will not kill a running job?
This is a plausible implementation for deactivating a worker. But a feature to save worker's configs and being able to set them up from the UI (for example after server reboot) would be amazing. (I was more looking for a spin up / spin down on-premise worker for previously spun worker configs [for example if I had a worker that took gpus 2-3 on server a and used trains.conf located at /path/trains.conf , store its configs and be able to bring it up and down straight from the UI without having to call the trains-agent daemon.. command], but that would work too).
If I understand you correctly, this would serve two main purposes
trains-agent
)If my understanding is correct, I think the "easiest" solution is integrating Kubernetes with the trains-agent. Basically you have k8s spin an agent (with specific resources controlled by k8s) then this agent connects to the trains-server and pulls jobs, when you decide to spin it down you use the k8s to spin the pod down. Obviously this means installing Kubernetes, which is never an easy integration, that said, once you have k8s running on the on-prem machines, adding k8s agent as another pod running on the nodes, is fairly easy :)
Hmm I fail to understand why there is need for K8s. if I use K8s i'm not sure why there is a need for trains-agent at all. What I ask is as follows: 1) When a worker is created, log it's configuration (GPUs to use, host, trains.conf, etc.) 2) Enable closing an active worker via GUI (same as trains-agent daemon --stop command) 3) Enable re-creation of stored workers (simply recreate the running command that was used to create these stored workers in the first place for all I know). If the machine is offline, simply print an error, that shouldn't be your concern.
why there is need for K8s
You need to have an agent on the machine, spinning the trains-agent process up/down, that agent could be the k8s daemon.
if I use K8s i'm not sure why there is a need for trains-agent at all.
A few points that immediately come to mind
trains
job... Becuase k8s was designed for DevOps not ML data-scientists / engineers.trains
(again, k8s was not designed for it)Specifically regrading the suggestion:
(2) / (3) means you actually have a daemon on the machine shutting down the trains-agent
process...
Obviously (2)/(3) could be implemented (kind of) with enable/disable flag for the trains-agent
daemon, controlled the flag form the UI / RestAPI.
Unfortunately expanding the management capabilities of the trains-agent
in such a way, is currently out of scope for the open-source project. The sugar-coating is, I'm pretty sure they have what you are looking for, and even more, in the enterprise edition :)
Hi, currently if I have a worker running in a detached daemon, I can only close it by daemon --stop. I would appreciate being able to close workers through the UI as well :)