Open graelo opened 1 year ago
Hi @graelo,
However there's one thing I could not get to work correctly with podman: the agent-services. I can launch it and see the clearml-services worker in the services queue, tasks can be scheduled on it, but they are blocked as soon as the spun docker job needs to communicate with the API. The docker container of course does not have access to the podman networks where the services live.
I am not sure I understand access to which services does not work, and from which docker container exactly (from the agent's container? from the spun task container?) - can you please elaborate?
Thanks @jkhenning
I'll be more precise.
My entire setup runs on podman, so this includes the networks clearml_frontend
and clearml_backend
(via the netavark driver, for rootless containers).
The first thing I tried was to run the unmodified allegroai/clearml-agent-services
image via podman, so upon start, due to the image's existing entrypoint.sh
that podman container runs clearml-agent daemon --docker ...
. Before any task is scheduled on the services
queue, that agent shows up as a worker clearml-services
in the UI on the services
queue.
However when an optimization task is scheduled on the services
queue, the agent spins up a docker container, and then that docker container hangs after initial packages installation. I believe this is because it cannot access the backend and frontend networks created by podman. So it can't access apiserver:8008
, fileserver:8081
, etc. Nothing surprising in fact. It just took me some time see that issue ;)
The second thing I tried is to modify the entrypoint.sh
in order to launch the agent without the --docker
flag and args. I only modified the last line into the following one:
clearml-agent daemon $DAEMON_OPTIONS --queue $QUEUES --cpu-only ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS}
You can access that modified image at graelo/clearml-agent-services-pip
(it also uses a recent ubuntu for crypto in Python 3.10). In this case, the podman container starts the agent in pip mode: the agent indeed connects to the apiserver:8008 etc, it creates the services queue, and it registers itself as a worker but with an id such as clearml-services:<taskid>
. This time, when scheduling an optimizer task on the queue, the task is picked up by the agent, but a new agent is declared on the queue:
and the task fails with the following
1681837594230 silence info ClearML Task: created new task id=5c848f5e20694b55b8c32d4c86a02b17
ClearML results page: https://app.clearml.example.cc/projects/2da27d3b0d864fe889a6403034a79df8/experiments/5c848f5e20694b55b8c32d4c86a02b17/output/log
1681837609452 clearml-services INFO task 5c848f5e20694b55b8c32d4c86a02b17 pulled from 0872f9496cbb4efb82e22209e5b14253 by worker clearml-services
1681837614568 clearml-services DEBUG Using environment access key CLEARML_API_ACCESS_KEY=2J4MO046IXQI40J380SU
Using environment secret key CLEARML_API_SECRET_KEY=********
Running task '5c848f5e20694b55b8c32d4c86a02b17'
1681837614644 clearml-services ERROR User aborted: stopping task (3)
Am I missing some piece of configuration?
For the 2nd scenario above, I tried setting the worker id and/or name via the CLEARML_WORKER_xx
env variables, but the same error occurred. I guess this is because these variables determine the agent id/name when first registering on the queue, but not what happens when a task is scheduled.
Thanks for your help!
@graelo - I have the same issue - would like to understand what's going on
@graelo @prassanna-ravishankar , is your setup changed in some way from the default? I can set up an open-source clearml server with the built-in services agent, and the tasks it spins up can reach the clearml server as usual...
I created a pull request (#206) to remove the docker flag when CLEARML_AGENT_NO_DOCKER
is set.
Hi, thanks!
I'll dig into it, but expect some delay before I provide feedback because it's been some time I have not dug into this.
Hi, thanks for ClearML, it's awesome.
I orchestrate my containers on a single server using podman (managed with systemd units, with all stuff in zfs datasets). It works great. I'll share my setup, but it's basically a reformulation of the docker-compose file.
However there's one thing I could not get to work correctly with podman: the agent-services. I can launch it and see the clearml-services worker in the services queue, tasks can be scheduled on it, but they are blocked as soon as the spun docker job needs to communicate with the API. The docker container of course does not have access to the podman networks where the services live.
I tried to avoid docker in the following way: I built a new image where the services/entrypoint.sh simply starts the agent without the
--docker
flag and arguments. However, a new worker-id being added to the queue and it's not entirely clear to me but it messes up the config and the scheduled tasks fail invariably.I probably missed a simple config variable to have it working, could you point me to it? Basically, in the agent-services container, can we run
clearml-agent
without--docker ${...}
?Thanks!
PS: At the moment, I'm back to running my hpo from a non-containerized process, running the agent in pip mode, and it works great.