allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
235 stars 90 forks source link

clearml-agent build not building a docker image #182

Closed dpkirchner closed 8 months ago

dpkirchner commented 8 months ago

I'm trying to build a docker image that my clearml setup will run when I queue a task. I've been adapting the docs found here: https://clear.ml/docs/latest/docs/guides/clearml_agent/exp_environment_containers . When I get to the clearml-agent build step, I see the container code run successfully (I didn't expect that) and eventually the process hangs. When I exec into the build container I see the docker process has died, becoming a zombie:

docker exec -it 31ab7d39b748 bash
root@31ab7d39b748:/app# ps uxaww
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  1.7  0.2 110440 45552 pts/0    Ss+  20:26   0:00 /usr/bin/python3 /usr/local/bin/clearml-agent build --id 8d4703d33923453aa68fe2707a751860 --docker --target testing2 --log-level DEBUG
root        10  0.0  0.0      0     0 pts/0    Z+   20:26   0:00 [docker] <defunct>
root        23  0.0  0.0  18520  3328 pts/1    Ss   20:26   0:00 bash
root        38  0.0  0.0  34416  2936 pts/1    R+   20:26   0:00 ps uxaww
root@31ab7d39b748:/app# exit

I suspect I'm just doing something wrong, I'm having a lot of trouble finding docs that show how to create the container that will run the task, and that I need to change how my docker container is built, however I'm submitting this because there appears to be a bug in the agent code that results in the agent not detecting that the docker command exited.

Here's what I have in my Dockerfile:

FROM ubuntu:22.04

RUN apt-get update && \
  DEBIAN_FRONTEND=noninteractive apt-get install -y \
  python3 \
  python3-pip

RUN pip3 install --upgrade pip

WORKDIR /app

COPY requirements.txt .

RUN pip3 install -r requirements.txt

COPY . .

ENTRYPOINT ["python3", "test.py"]

and my requirements.txt:

clearml
clearml-agent
matplotlib==3.5.1
numpy==1.26.3

and the code is an exact copy of the example shown on the "new experiment" screen in the web UI. The command I'm running to do the agent build is:

docker run --gpus all -it --rm   -v $HOME/clearml-agent.conf:/root/clearml.conf -v /tmp:/tmp  -v /var/run/docker.sock:/var/run/docker.sock   --network clearml_backend   --user root   -e CLEARML_API_HOST=http://apiserver:8008   -e CLEARML_WEB_HOST=http://webserver:8080   -e CLEARML_FILES_HOST=http://fileserver:8081   -e CLEARML_AGENT_QUEUES=default  -e CLEARML_API_ACCESS_KEY=xxx -e CLEARML_API_SECRET_KEY=xxx -v .:/app --entrypoint clearml-agent --workdir /app allegroai/clearml-agent:latest build --id 8d4703d33923453aa68fe2707a751860 --docker --target testing2 --log-level DEBUG

where ., my current working directory, contains my dockerfile, requirements.txt, and test.py files.

dpkirchner commented 8 months ago

When running strace on the clearml-agent build and docker run commands I see that docker run exits normally (exit code 0) and the agent is notified via SIGCHLD, but the agent doesn't seem to do anything:

stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10, si_uid=0, si_status=0, si_utime=1, si_stime=2} ---
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=510043}) = 0 (Timeout)
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0

It continues looking for the "done" line to appear in the dummy .cfg file. I'm not sure os.wait or process.wait are being called anywhere, so I don't know how this is supposed to work.

jkhenning commented 8 months ago

Hi @dpkirchner,

I see you added your own entrypoint - for docker images built by the agent, you should leave the entrypoint as is

dpkirchner commented 8 months ago

Hmm.. if I do that, how does the agent know what code to run inside the container? Is there an environment variable somewhere?

dpkirchner commented 8 months ago

I'm now using the latest clearml-agent code built based on a more recent nvidia/cuda image (see https://github.com/allegroai/clearml-agent/issues/180#issuecomment-1912777944). clearml-agent build no longer hangs, however it does not build an image, either.

My test.py file remains as is and I've removed my Dockerfile. Now to build the image I am running:

docker run --gpus all -it --rm   -v $HOME/clearml-agent.conf:/root/clearml.conf -v /tmp:/tmp  -v /var/run/docker.sock:/var/run/docker.sock   --network clearml_backend   --user root -v .:/app --workdir /app --entrypoint clearml-agent clearml-agent:latest build --id dcf8c2d634a74b13a6d0f2e62c203201 --docker --target testing2 --log-level DEBUG --entry-point reuse_task

The most important output, I think, is at the end:

Virtual environment: /root/.clearml/venvs-builds/3.10/bin
Source code: /root/.clearml/venvs-builds/3.10/code/test.py
Entry point: /root/.clearml/venvs-builds/3.10/code/test.py
root@e5b12f93a2e0:/#
Docker build done
Committing docker container to: /app/testing2
None

That None looks to be the result of commit_docker called at https://github.com/allegroai/clearml-agent/blob/95dde6ca0cac717d2094114699c11bd1f0d38040/clearml_agent/commands/worker.py#L2355 . My read of https://github.com/allegroai/clearml-agent/blob/95dde6ca0cac717d2094114699c11bd1f0d38040/clearml_agent/helper/process.py#L145 suggests that the only way this could happen is if git_bash_output returns None. /app/testing2 also looks wrong to me.

Enabling raise_error, printing the command and exception's .output:

    except subprocess.CalledProcessError as e:
        print(cmd)
        print('output:')
        print(e.output)
        if raise_error:
            raise

and running again results in:

Committing docker container to: /app/testing2
docker commit --change='ENTRYPOINT if [ ! -s "/tmp/clearml.conf" ] ; then cp ~/default_clearml.conf /tmp/clearml.conf && export CLEARML_CONFIG_FILE=/tmp/clearml.conf;  fi ; clearml-agent execute --id dcf8c2d634a74b13a6d0f2e62c203201 --standalone-mode ' 349e128ed3ea6e02868b75fa5810ad2061a4100cc4e25c0734402335c5663a1c /app/testing2
output:
b'invalid reference format\n'
Failed storing requested docker
False

I think I must still be doing something wrong here -- the fact that exceptions and output are all hidden does make it harder to figure out what to do. How do we build docker images using clearml-agent build?

jkhenning commented 8 months ago

Hi @dpkirchner,

Why are you building from within a docker container? Also, we recommend using the latest clearml-agent version to build, and you're using the agent inside clearml-agent:latest which is an old outdated image - please simply install the latest agent in your workspace using pip and try using it.

dpkirchner commented 8 months ago

I'd prefer to build things inside a docker container so I don't have to worry about having everyone install system and python dependencies on their computers -- python dependency management is kind of a mess, exacerbated by some folks using macOS and others Linux. I believe I am using the latest version of the agent inside the container (--version says 1.7.0).

If this is totally the wrong way to go, could you point me to docs showing how to run the ML tasks inside docker containers, controlled by workers registered to clearml-server?

jkhenning commented 8 months ago

Why exactly are you trying to use clearml-agent build in multiple (client) machines? The best recommended course of action to run tasks inside docker containers would be to use the agent in daemon docker mode and have the agent launch tasks in docker contains, without the need to build them

dpkirchner commented 8 months ago

OK, I guess I'm totally missing something. How do you get the task code to run inside a docker container (run by the agent in docker mode) without building the container with the appropriate COPY lines? I thought that's what the build command is for, getting your code and other relevant files in place.

jkhenning commented 8 months ago

@dpkirchner , please take a look at the CleaRML Agent's documentation, where the entire task execution process is explained in detail: https://clear.ml/docs/latest/docs/clearml_agent

dpkirchner commented 8 months ago

OK, thanks, I'll read that over again to see if I can figure out exactly what I'm not understanding.