allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
364 stars 132 forks source link

Worker doesn't execute task in docker mode #188

Open GPrev-Lab4i opened 1 year ago

GPrev-Lab4i commented 1 year ago

Worker doesn't execute task in docker mode

Environment

ClearML server version : 1.9.2 OS : Debian 10

Steps to reproduce :

Observed behaviour :

Expected behaviour :

Possible cause of the problem :

I think the agent-services container can see the apiserver container because they are on the same virtual network (backend). The created worker container is not connected to this network, and so it can't access the apiserver and execute the task.

Ideas on how to solve the problem

Solution A :

I was able to solve this problem by changing the value of "CLEARML_API_HOST" in the file docker-compose.yml, under "agent-services" -> "environment". By default, it is set to "http://apiserver:8008/", and I changed it to "http://${CLEARML_HOST_IP}:8008". That way, the worker container connects to the apiserver through the host, not relying on a virtual network.

Possible solution B :

Another idea would be to find a way to give the worker container access to the virtual network "backend", so that it could use it to connect to apiserver.

Possible solution C :

Another idea would be to configure the agent not to use another docker container, but create workers within its own container. This might have performance implications though.

Example

Example task used for testing :

import sys
from clearml import Task
task = Task.init(project_name='test clearml-agent', task_name=f'Test-Agent-{sys.version_info[0]}-{sys.version_info[1]}')
task.execute_remotely()
print('hello world begin')
A = 'hello world'
task.upload_artifact('hello world', artifact_object=A)
print('hello world end')
task.mark_completed()

Log as seen from the ClearML web interface (no further relevant logs were found from going inside the agent and worker containers) :

2023-03-22 15:54:15 Collecting zipp>=3.1.0; python_version < "3.10"

Using cached zipp-3.6.0-py3-none-any.whl (5.3 kB)

Collecting typing-extensions>=3.6.4; python_version < "3.8"

Using cached typing_extensions-4.1.1-py3-none-any.whl (26 kB)

Installing collected packages: attrs, six, orderedmultidict, furl, certifi, urllib3, charset-normalizer, requests, pyparsing, psutil, pyjwt, PyYAML, distlib, zipp, importlib-resources, typing-extensions, importlib-metadata, filelock, platformdirs, virtualenv, pyrsistent, jsonschema, pathlib2, python-dateutil, clearml-agent

Attempting uninstall: six

Found existing installation: six 1.11.0

Uninstalling six-1.11.0:

Successfully uninstalled six-1.11.0

Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-2.0.12 clearml-agent-1.5.1 distlib-0.3.6 filelock-3.4.1 furl-2.1.3 importlib-metadata-4.8.3 importlib-resources-5.4.0 jsonschema-3.2.0 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-2.4.0 psutil-5.9.4 pyjwt-2.4.0 pyparsing-3.0.9 pyrsistent-0.18.0 python-dateutil-2.8.2 requests-2.27.1 six-1.16.0 typing-extensions-4.1.1 urllib3-1.26.15 virtualenv-20.17.1 zipp-3.6.0

WARNING: You are using pip version 20.1.1; however, version 21.3.1 is available.

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
2023-03-22 23:42:44 clearml_agent: ERROR: Connection Error: it seems api_server is misconfigured. Is this the ClearML API server http://apiserver:8008 ?
2023-03-22 23:42:44 Process failed, exit code 1
jkhenning commented 1 year ago

Hi @GPrev-Lab4i,

Thanks for the detailed report. Your conclusions seem on point - basically, the intended use is to define the server's URLs as external ones (i.e. not the internal docker network) at least for the web service and fileserver since these are used when registering data and we expect them to be externally accessible names (so that when the information is presented in a remote machine, it will try to access the correct addresses).

You can, however, set up the newly spun docker container to use the backend network quite easily, by configuring the services agent CLEARML_AGENT_EXTRA_DOCKER_ARGS environment variable with the required docker options, by passing for example CLEARML_AGENT_EXTRA_DOCKER_ARGS=--network=backend

GPrev-Lab4i commented 1 year ago

Hi @jkhenning and thank you for your answer. I did not know about CLEARML_AGENT_EXTRA_DOCKER_ARGS, I will keep it in mind as it could be useful in other situations. If I understand correctly, the solution I found seems to be in line with the intended use. Do you think it would make sense to edit "docker/docker-compose.yml" with those changes ?

jkhenning commented 1 year ago

That's a good question, basically setting it hard-coded to the backend network will make it so that users will not be aware of this and will keep it this way, and might have data reported using this internal URL (which will prevent them from accessing it externally). Still, something is better than nothing? 🙂

GPrev-Lab4i commented 1 year ago

Maybe I didn't express it clearly, but I was thinking of the opposite : setting it hard-coded to use the external IP : CLEARML_API_HOST: http://${CLEARML_HOST_IP}:8008. That way, if I understand correctly, it should work for every usecase.