allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
235 stars 90 forks source link

Agent on Mac doesn't pull tasks from queue and automatically unregisters from Server after a while #200

Closed ruipimentel closed 4 months ago

ruipimentel commented 4 months ago

I have installed clearml-agent PIP package to a Macbook, initialized it with clearml-agent init, then registered it to a list of queues using clearml-agent daemon --queue heavy default light --detached. It is even listed under the "Workers & Queues" tab in my self hosted server's ClearML Web interface.

However, no task is ever executed (automatically) by this Mac agent, even when agents running in Linux systems pull items from the queues and finish them successfully (while configured in exactly the same manner). Also, even though no error is returned by the Mac agent, I've confirmed that it has SSH access to the GitHub repository and that it can even successfully run tasks by using clearml-agent execute --id <task-id>. However, it won't pull tasks automatically, and after a while, it even disappears from the "Workers" UI.

Am I doing something wrong?

jkhenning commented 4 months ago

Hi @ruipimentel,

Can you attach the agent's console output? The easiest way will be to re-run it without the --detached flag, and adding --foreground and capturing stdout and stderr - I assume it either prints out some errors and/or crashes

ruipimentel commented 4 months ago

Hi @jkhenning, I'm sorry it took so long for me to reply back.

But the good news is I am not able to reproduce the error, that is, the agent reliably works on Mac now.

This problem was probably due to my lack of experience using ClearML; for example, it took me a while to learn that I cannot close the Terminal the agent is running in even if I had used the --detached flag. Also, some of the tasks don't run on my Mac, as others don't run in the Linux machines, so I guess all these problems mixed up together ended up getting me confused.

I apologize for the confusion, and really appreciate your time! Thanks!