allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
235 stars 90 forks source link

Agents disappear after machine restart #40

Open majdzr opened 3 years ago

majdzr commented 3 years ago

Hello,

Thanks for the awesome product and especially the trains-agent.

I have a question/issue regarding the persistency of the agents:

Background: I have an Ubuntu machine, running several agents that I created using the following command:

TRAINS_WORKER_ID=servername:gpu1c_only trains-agent daemon --detached --gpus 1 --create-queue --queue gpu1_only --docker nvcr.io/nvidia/pytorch:20.08-py3

If it make any difference, on a 2-GPU machine, I have several agents running on a single GPU with their own queue and another set of agents running on 2 GPUs, also with their own queue.

Issue: When I restart the machine, the agents disappear and I need to recreate them. The only one that survives the reset is the services agent. If it matters, the UI still shows the deleted queues in the enqueue menu. [ A bonus question: how can I clean the list? ]

Question: How to make these agents persistent?

Thanks in advance.

bmartinn commented 3 years ago

Hi @majdzr

When I restart the machine, the agents disappear and I need to recreate them.

You mean how to spin the agent automatically every reboot? If this is the case, and assuming you have Ubuntu 16.04 or above crontab for the rescue :

crontab -e

Then just add a line per agent:

@reboot /bin/bash -c  "TRAINS_WORKER_ID=servername:gpu1c_only trains-agent daemon --detached ..."

More details on crontab can be found here Notice that the crontab implicitly takes the user executing it (e.g. the jobs a root schedules are different from the jobs a user schedules). Make sure the trains-agent is executed from your own user as the default configuration file is ~/trains.conf, and even though you can specify a configuration file with the flag --config-file it is not recommended to run the agent as root :)