allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
231 stars 89 forks source link

Image on Docker Hub is out of date #180

Open dpkirchner opened 7 months ago

dpkirchner commented 7 months ago

I'm just getting started with clearml (learning the ropes). Per the README section describing Kubernetes integration I tried using the image found on dockerhub, running it outside of k8s: docker run --gpus all -it --rm -v $HOME/clearml-agent.conf:/clearml.conf -v /var/run/docker.sock:/var/run/docker.sock --network clearml_backend --user root. (clearml_backend comes from https://github.com/allegroai/clearml-server/blob/702b6dc9c804165b192a042253ad1d1690c5f0ed/docker/docker-compose.ymlandclearml-agent.confwas created byclearml-agent init`).

The output of this command is just: CLEARML_AGENT_UPDATE_VERSION = and the worker does not register. clearml-agent appears to be version 0.17.1 FWIW.

Then I noticed that the image was last updated about 3 years ago. Upgrading the clearml-agent package using pip install --upgrade clearml-agent and bind-mounting the configuration file in the /root directory resolved the problem, however I'm sure there'll be a lot of other issues when using such an old base image (e.g., old CUDA).

I think this might just be a matter of updating the dockerfile to pin a version of nvidia/cuda (base image) and pushing to hub.

jkhenning commented 7 months ago

Hi @dpkirchner , the link you provided does not seem to work - I didn't quite understand which image you used

dpkirchner commented 7 months ago

My bad, I added an extra backtick in the link: https://github.com/allegroai/clearml-server/blob/702b6dc9c804165b192a042253ad1d1690c5f0ed/docker/docker-compose.yml

The image I used was linked from here: https://github.com/allegroai/clearml-agent/blob/c9fc092f4eea9c3890d582aa2a098c3c2f39ce72/README.md#kubernetes-integration-optional (scroll down to Spin ClearML-Agent as a long-lasting service pod).

jkhenning commented 7 months ago

Oh, I see it now. Honestly I think we should remove this option - this option basically spawns tasks as processes inside the agent's pod, which is not a good pattern in k8s - I would recommend using the helm chart

dpkirchner commented 7 months ago

I see, ok. I'll check out the helm chart. Thanks.

dpkirchner commented 7 months ago

It looks like the docker container used by the helm chart is also out of date -- it's running clearml-agent 1.2.4rc3 and using python 3.6. The image that is closest to being up to date is allegroai/clearml:1.14.0-431, however you'll need to install docker and the clearml-agent python package to use it, and it's still a bit out of date.

Through experimentation I've found that if you want to use the latest version, you can check out https://github.com/allegroai/clearml-agent, go to the docker/agent directory and edit Dockerfile, replace FROM nvidia/cuda with FROM nvidia/cuda:12.0.0-devel-ubuntu22.04 (can't use 12.3.1 because of a cuda-related bug in nvidia's image), and then build the image locally (I'm using docker build -t clearml-agent:latest . in the docker/agent directory). Following these steps will get you version 1.7.0.

I'm reopening because I'm not sure if this is all intended -- is the allegro/clearml-agent docker image deprecated in general?

(I should note that the clearml-agent build command run in this image does not result in a docker image, but I think that's unrelated, and something to be tracked in a different issue.)

jkhenning commented 7 months ago

Hi @dpkirchner,

The docker image used by clearml-helm-charts/clearml-agent chart is indeed pretty old (we're supposed to update it soon) and it's the allegroai/clearml-agent-k8s-base image. However, it is not related to the allegroai/clearml-agent image

thomsmoreau commented 4 months ago

Hi @dpkirchner, Do you have an info about the docker image update on the docker HUB ? There is a lot of outdated elements in it like the "k8s_glue_example.py" not taking list of queues for example

I cannot find a proper way to build the image even with the 'docker' folder from the repository, is that possible to provide a README to build it in local ?

dpkirchner commented 4 months ago

I wasn't able to figure out how to use clearml properly, unfortunately, so I moved on to another project.

surya9teja commented 3 months ago

@dpkirchner Frankly, I have been hopping into different kinds of MLOps started with airflow + mlflow but it lack dataset versioning. So i moved to clearml and we use k8s (EKS) for most of our ETL pipelines. So I deploy clearml-server which works fine but now I have tried to deploy clearml-agent in cluster but it seems having issues with accessing api server clearml_agent.backend_api.session.session.LoginError: Failed getting token (error 401 from https://api.clear.ml): Unauthorized (invalid credentials) (failed to locate provided credentials) As the clearml documentations are not clear about helm charts deployment, it's really hard to understand the code and do PRs.

surya9teja commented 3 months ago

@thomsmoreau As I can see there a folder k8s-glue which seems have a various versions of docker images. Based on your cloud you can modify Dockerfile and update the outdated packages.

Note: During the build you have to modify/ add clearml.conf with your credentials as per the Dockerfile script.

I am not a fan of putting credentials into the docker image build but at the same time helm chart value has an option to pass the credentials as a secret which is not working now.

In terms of passing list of queues for k8s_glu_example.py you can pass it as 'queue1,queue2' check here in values.json of helm chart make sure there won't be any spaces between the strings.

thomsmoreau commented 3 months ago

@surya9teja I belived that the k8s_glu_example.py file into the docker image was up to date but it is not. The version of it into the docker image provided by the chart does not take into account the separator "," into the string (passed by the "queue"argument) so I had to update it manually, firstly by doing a curl on the raw link you provided (I pulled the chart and changed the templates manually) and then by building a custom image for my company into which I just changed the script and it works fine ! . I did it about a month ago

Since then I didn't check for updates on the docker images but I think we can have better outputs in terms of udated content and performances if devs could push themselves an update.

Thank for your message, I should have commented earlier to maybe help other people stuck as I was

thomsmoreau commented 3 months ago

@jkhenning Do you have any info about the update of the chart with an up to date docker image ?