allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
229 stars 89 forks source link

Docker container of the cloned task crashes/stucks. #189

Open MattBlue92 opened 4 months ago

MattBlue92 commented 4 months ago

Hello everyone, I installed clearml server and clearml agent locally with docker on a ubuntu linux system following the documentation guide. My problem is with the task clone after write code and logged. If I clone a task and assign it to a queue that uses virtual enviroment mode for execution, then the clone executes all the code correctly, however, if I clone a task and then assign it to a queue that uses docker for execution, the container gets started, downloads packages but does not execute the task code. Where am I going wrong?

PS: To be clear, the cloned task container will not crash or die, because it is possible to enter the container with docker exec -it id_container /bin/bash ...so it is as if clearml were merely creating the container.

blinor commented 2 months ago

Hey there, i got the exact same problem. It will install everything with apt and pip and then stop working. Also trying to run the command used directly in the docker container doesn`t work.

jkhenning commented 2 months ago

Hi,

Can you include a full log of the task execution?

blinor commented 2 months ago

Thanks for the quick response. I startet the deamon with: clearml-agent daemon --queue "4gb" --docker clearml/fractional-gpu:u22-cu11.7-4gb --force-current-version this is my clearaml.conf on the server:

agent {
    # Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
    git_user:"XXX"
    git_pass:"XXX"
    # all other domains will use public access (no user/pass). Default: always send user/pass for any VCS domain
    git_host:""
    package_manager: {
        type: pip,
        pip_version: [""]
        pytorch_resolve: none
        extra_pip_install_flags: ["--user"]
        extra_index_url: ["XXX"]
    }
    # Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
    force_git_ssh_protocol: false

    # unique name of this worker, if None, created based on hostname:process_id
    # Overridden with os environment: CLEARML_WORKER_NAME
    worker_id: ""
    docker_use_activated_venv: false
    extra_docker_arguments: ["--pid=host","-e","http_proxy=XXX", "-e","https_proxy=XXX"]
}

More or less, i switched all used parameters in the config.

Besides i tried a completly new setup on a different computer with the default config getting the same result. Also I tried to use an older version of the agent (1.6) but that didn`t work aswell. log_txt.txt

jkhenning commented 2 months ago

From the looks of it, it looks like the execution inside the container cannot reach the ClearML Server - can you add -e CLEARML_AGENT__AGENT__DEBUG=1 to the task's container arguments (in the execution section) and see if you get more logs from the agent? Also, if you can exec into the container, you can check the clearml.conf file mapped inside and see its contents, this might provide some clues

blinor commented 2 months ago

Shure thing. default_conf.txt task_33470eb457b94a578784674546d3d397.log It don`t seems to me, that there are different logs and also the default_conf looks correct.

My first thougth was that a proxy-setting is causing the problems, but on a different machine without any proxys my logs and problems are the same.

jkhenning commented 2 months ago

"api_server": "http://localhost:8008"

Is this reachable from inside the container? It seems to me this won't resolve to anything... Try adding --ipc=host to the task container arguments (it would make more sense to put this in the agent's default docker extra args)

blinor commented 2 months ago

You are correct, i can't reach http://localhost:8008. What setting do you mean with task container argument? is this agent.extra_docker_arguments ?

jkhenning commented 2 months ago

Yes, that would work

blinor commented 2 months ago

Sadly I still cannot reach the API.

jkhenning commented 2 months ago

Where is the server running?

blinor commented 2 months ago

The server runs at the same machine from where i try to execute my task. also i tried it both on windows and Linux

blinor commented 2 months ago

Small Update, i didnt't change anything but tried again to start a agent with docker mode and got a different output. I now get the following output bevor nothing happens: ` Installing collected packages: distlib, zipp, urllib3, six, rpds-py, PyYAML, pyparsing, pyjwt, psutil, platformdirs, pkgutil-resolve-name, idna, filelock, charset-normalizer, certifi, attrs, virtualenv, requests, referencing, python-dateutil, pathlib2, orderedmultidict, importlib-resources, jsonschema-specifications, furl, jsonschema, clearml-agent

1713784929586 DLB1:gpu1 DEBUG Successfully installed PyYAML-6.0.1 attrs-23.2.0 certifi-2024.2.2 charset-normalizer-3.3.2 clearml-agent-1.8.0 distlib-0.3.8 filelock-3.13.4 furl-2.1.3 idna-3.7 importlib-resources-6.4.0 jsonschema-4.21.1 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.2.0 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.2 python-dateutil-2.8.2 referencing-0.34.0 requests-2.31.0 rpds-py-0.18.0 six-1.16.0 urllib3-1.26.18 virtualenv-20.25.3 zipp-3.18.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

The following additional packages will be installed: libblas3 libfreetype6 libgfortran5 libimagequant0 libjbig0 libjpeg-turbo8 libjpeg8 liblapack3 liblbfgsb0 liblcms2-2 libpng16-16 libtiff5 libwebp6 libwebpdemux2 libwebpmux3 python3-decorator python3-numpy python3-olefile python3-pil Suggested packages: liblcms2-utils gfortran python-numpy-doc python3-pytest python3-numpy-dbg python-pil-doc python3-pil-dbg python-scipy-doc The following NEW packages will be installed: libblas3 libfreetype6 libgfortran5 libimagequant0 libjbig0 libjpeg-turbo8 libjpeg8 liblapack3 liblbfgsb0 liblcms2-2 libpng16-16 libtiff5 libwebp6 libwebpdemux2 libwebpmux3 python3-decorator python3-numpy python3-olefile python3-pil python3-scipy 0 upgraded, 20 newly installed, 0 to remove and 32 not upgraded. Need to get 18.5 MB of archives. After this operation, 77.4 MB of additional disk space will be used. Do you want to continue? [Y/n]

`