allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
242 stars 92 forks source link

ClearML Agent as a systemd service #210

Open niko-zvt opened 5 months ago

niko-zvt commented 5 months ago

Hello!

Please tell me if there is an example of configuring a service for systemd that implements work with clearml-agent. The service file with unit that I'm creating is unstable. Often, the agent simply falls off and cannot be restarted. Although the CLI commands clearml-agent deamon ... work perfectly separately.

  1. Could this be due to the fact that I explicitly specify the daemon sub-command?
  2. What options are there for managing/serving agents other than manually?

I have to use the agent as a service for two reasons: a. When restarting the server, the agent doesn't start on its own, it must be started manually or a command call is prescribed after loading (which is not a good practice). b. I still haven't figured out if I can use the agent inside the docker container (Docker-in-Docker). Since the agent itself uses docker to create isolated containers for tasks based on nvidia-cuda images.

clearml-agent-gpu.service

[Unit]
Description=ClearML Agent Service
After=docker.target

[Service]
Type=forking
User=ml-worker
WorkingDirectory=/home/ml-worker/clearml-agent-virtualenv
ExecStart=/home/ml-worker/clearml-agent-virtualenv/bin/clearml-agent daemon --detached --queue default --gpus all
ExecStop=/home/ml-worker/clearml-agent-virtualenv/bin/clearml-agent daemon --detached --queue default --gpus all --stop
Restart=always
Environment="PATH=/home/ml-worker/clearml-agent-virtualenv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

[Install]
WantedBy=multi-user.target

systemctl output for sudo systemctl start clearml-agent-gpu + sudo systemctl status clearml-agent-gpu image