Closed dpkirchner closed 9 months ago
When running strace
on the clearml-agent build
and docker run
commands I see that docker run
exits normally (exit code 0) and the agent is notified via SIGCHLD
, but the agent doesn't seem to do anything:
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = ? ERESTARTNOHAND (To be restarted if no handler)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10, si_uid=0, si_status=0, si_utime=1, si_stime=2} ---
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=510043}) = 0 (Timeout)
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
select(0, NULL, NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)
stat("/tmp/.clearml_agent.06kc_f68.cfg", {st_mode=S_IFREG|0600, st_size=4841, ...}) = 0
It continues looking for the "done" line to appear in the dummy .cfg file. I'm not sure os.wait
or process.wait
are being called anywhere, so I don't know how this is supposed to work.
Hi @dpkirchner,
I see you added your own entrypoint - for docker images built by the agent, you should leave the entrypoint as is
Hmm.. if I do that, how does the agent know what code to run inside the container? Is there an environment variable somewhere?
I'm now using the latest clearml-agent
code built based on a more recent nvidia/cuda
image (see https://github.com/allegroai/clearml-agent/issues/180#issuecomment-1912777944). clearml-agent build
no longer hangs, however it does not build an image, either.
My test.py file remains as is and I've removed my Dockerfile. Now to build the image I am running:
docker run --gpus all -it --rm -v $HOME/clearml-agent.conf:/root/clearml.conf -v /tmp:/tmp -v /var/run/docker.sock:/var/run/docker.sock --network clearml_backend --user root -v .:/app --workdir /app --entrypoint clearml-agent clearml-agent:latest build --id dcf8c2d634a74b13a6d0f2e62c203201 --docker --target testing2 --log-level DEBUG --entry-point reuse_task
The most important output, I think, is at the end:
Virtual environment: /root/.clearml/venvs-builds/3.10/bin
Source code: /root/.clearml/venvs-builds/3.10/code/test.py
Entry point: /root/.clearml/venvs-builds/3.10/code/test.py
root@e5b12f93a2e0:/#
Docker build done
Committing docker container to: /app/testing2
None
That None
looks to be the result of commit_docker
called at https://github.com/allegroai/clearml-agent/blob/95dde6ca0cac717d2094114699c11bd1f0d38040/clearml_agent/commands/worker.py#L2355 . My read of https://github.com/allegroai/clearml-agent/blob/95dde6ca0cac717d2094114699c11bd1f0d38040/clearml_agent/helper/process.py#L145 suggests that the only way this could happen is if git_bash_output
returns None
. /app/testing2
also looks wrong to me.
Enabling raise_error
, printing the command and exception's .output
:
except subprocess.CalledProcessError as e:
print(cmd)
print('output:')
print(e.output)
if raise_error:
raise
and running again results in:
Committing docker container to: /app/testing2
docker commit --change='ENTRYPOINT if [ ! -s "/tmp/clearml.conf" ] ; then cp ~/default_clearml.conf /tmp/clearml.conf && export CLEARML_CONFIG_FILE=/tmp/clearml.conf; fi ; clearml-agent execute --id dcf8c2d634a74b13a6d0f2e62c203201 --standalone-mode ' 349e128ed3ea6e02868b75fa5810ad2061a4100cc4e25c0734402335c5663a1c /app/testing2
output:
b'invalid reference format\n'
Failed storing requested docker
False
I think I must still be doing something wrong here -- the fact that exceptions and output are all hidden does make it harder to figure out what to do. How do we build docker images using clearml-agent build
?
Hi @dpkirchner,
Why are you building from within a docker container? Also, we recommend using the latest clearml-agent version to build, and you're using the agent inside clearml-agent:latest
which is an old outdated image - please simply install the latest agent in your workspace using pip and try using it.
I'd prefer to build things inside a docker container so I don't have to worry about having everyone install system and python dependencies on their computers -- python dependency management is kind of a mess, exacerbated by some folks using macOS and others Linux. I believe I am using the latest version of the agent inside the container (--version
says 1.7.0).
If this is totally the wrong way to go, could you point me to docs showing how to run the ML tasks inside docker containers, controlled by workers registered to clearml-server?
Why exactly are you trying to use clearml-agent build in multiple (client) machines? The best recommended course of action to run tasks inside docker containers would be to use the agent in daemon docker mode and have the agent launch tasks in docker contains, without the need to build them
OK, I guess I'm totally missing something. How do you get the task code to run inside a docker container (run by the agent in docker mode) without building the container with the appropriate COPY
lines? I thought that's what the build
command is for, getting your code and other relevant files in place.
@dpkirchner , please take a look at the CleaRML Agent's documentation, where the entire task execution process is explained in detail: https://clear.ml/docs/latest/docs/clearml_agent
OK, thanks, I'll read that over again to see if I can figure out exactly what I'm not understanding.
I'm trying to build a docker image that my clearml setup will run when I queue a task. I've been adapting the docs found here: https://clear.ml/docs/latest/docs/guides/clearml_agent/exp_environment_containers . When I get to the
clearml-agent build
step, I see the container code run successfully (I didn't expect that) and eventually the process hangs. When I exec into the build container I see the docker process has died, becoming a zombie:I suspect I'm just doing something wrong, I'm having a lot of trouble finding docs that show how to create the container that will run the task, and that I need to change how my docker container is built, however I'm submitting this because there appears to be a bug in the agent code that results in the agent not detecting that the docker command exited.
Here's what I have in my Dockerfile:
and my requirements.txt:
and the code is an exact copy of the example shown on the "new experiment" screen in the web UI. The command I'm running to do the agent build is:
where
.
, my current working directory, contains my dockerfile, requirements.txt, and test.py files.