clearml-agent deployed with the latest Helm chart ignores docker parameters

allegroai / clearml-helm-charts

Helm chart repository for the new unified way to deploy ClearML on Kubernetes. ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

https://clear.ml/docs

35 stars 48 forks source link

clearml-agent deployed with the latest Helm chart ignores docker parameters #58

Closed stefano-cherchi closed 2 years ago

stefano-cherchi commented 2 years ago

Just upgraded my clearml-server on Kubernetes to the latest version of the chart:

$ helm list
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
clearml-server          default         4               2022-03-29 18:42:34.558007 +0200 CEST   deployed        clearml-3.8.0                   1.3.0

Until yesterday I've been using a very old version:

repository = "https://allegroai.github.io/clearml-server-helm"
chart      = "clearml-server-chart"
version    = "1.0.2+1"

In the old version, the agents were configured as follows:

agent:
  numberOfClearmlAgents: 2
  nvidiaGpusPerAgent: 0
  defaultBaseDocker: "python:3.8-bullseye"
  dockerMode: true
  agentVersion: "" # if set, it *MUST* include comparison operator (e.g. ">=0.17.2")
  queues: "default"

[...]

Since I use cv2 in some of my experiments, the execution would initially fail with this error:

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

But I could easily solve the issue by just adding this

apt update && apt install -y libgl1-mesa-glx

In the web UI in the field Execution -> Container -> SETUP SHELL SCRIPT

Now, in the new deployment, the agents are configured as follows:

agentGroups:
  agent-group-cpu:
    enabled: true
    name: agent-group-cpu
    replicaCount: 2
    updateStrategy: Recreate
    nvidiaGpusPerAgent: 0
    agentVersion: ""  # if set, it *MUST* include comparison operator (e.g. ">=0.16.1")
    queues: "default"  # multiple queues can be specified separated by a space (e.g. "important_jobs default")

    [...]

    image:
      repository: "python"
      pullPolicy: IfNotPresent
      tag: "3.8-bullseye"

    podAnnotations: {}

    nodeSelector: 
      app: "services"

    tolerations: []

    affinity: {}

The problem is, now the command apt install in the field SETUP SHELL SCRIPT seems to be completely ignored.

valeriano-manassero commented 2 years ago

Hi @stefano-cherchi , actually the K8s installation chart we have here is spawning agent groups during Helm install and we are not spawning new Docker containers when a new task is submitted. I think this is the point in time where SETUP SHELL SCRIPT is called so we are never there with K8s.

First suggestion that comes to my mind is to use a different image configuration in agent group like:

    image:
      repository: "dkimg/opencv"
      pullPolicy: IfNotPresent
      tag: "4.3.0-debian"

If I'm not wrong this image already contains a working set for opencv.

Let me know if this works for you.

/cc @jkhenning do you have better solutions in mind?

stefano-cherchi commented 2 years ago

Hi @valeriano-manassero , thank you for your quick answer.

Does this mean that the "docker mode" isn't available anymore? So why is that field still in the UI?

The solution you suggest is likely to work as a temporary workaround (as well as building my own custom image would) but my point is that it was super-useful and convenient to be able to run commands in the container through the web UI, especially in the early stages of the development, when you're experimenting and prototyping at a quick pace.

That allowed us to quickly fix or optimize configurations, and eventually consolidate them after finding the right recipe, instead of building a different custom Docker image for each experiment.

Being forced to change the container image and roll out a new helm release (or rebuild the custom docker image) every time I need a small change in the environment of my experiment is incredibly time-consuming and ineffective.

Plus, being able to change the container's configuration for each execution added a lot of agility to the development, giving even people with no direct access to the code the opportunity of making fast significant changes in a super-granular way.

After all, the whole point of the web UI is enabling everybody in the ML team to quickly interact with the experiments, playing with their parameters (installed packages, configurations, properties, etc) cutting the corners of a traditional git workflow.

I think this is a huge step back in terms of MLOps approach. In our specific scenario, this is a strong showstopper.

Would it be possible, in your opinion, to re-enable the previous behaviour?

valeriano-manassero commented 2 years ago

Does this mean that the "docker mode" isn't available anymore? So why is that field still in the UI?

This Helm chart is trying to be as much agnostic on infrastructure/platform as possible. Just as example Docker is being dropped basically everywhere in Kubernetes world so we need to be careful using modes that require to access underlining engines. With that said, today the main way to deploy ClearML is usually Docker-Compose and I'm trying to bring every aspect under Kubernetes.

I think this is a huge step back in terms of MLOps approach. In our specific scenario, this is a strong showstopper.

Would it be possible, in your opinion, to re-enable the previous behaviour?

Keep in mind that install/remove packages at runtime may result in unexpected results because a Task will not be idempotent anymore so a new task can be scheduled on already running Agent that was modified before.

With that said, I got your point and obviously I missed this specific field while creating the Chart; I need to deep dive into it and try to find a solution to get back this behaviour using k8sglue implementation that should spawn pods on demand.

valeriano-manassero commented 2 years ago

@stefano-cherchi maybe I found a good way to achieve what you need, pls use chart version 3.8.1 (released some minute ago) and use these values:

agentGroups: []

agentk8sglue:
  enabled: true
  image:
    repository: "allegroai/clearml-agent-k8s"
    tag: "latest"
  serviceAccountName: default
  maxPods: 10
  defaultDockerImage: ubuntu:18.04
  queue: default    # create this queue manually in the UI first for it to work
  id: k8s-agent
  podTemplate:
    volumes: []
      # - name: "yourvolume"
      #   path: "/yourpath"
    env: []
      # # to setup access to private repo, setup secret with git credentials:
      # - name: CLEARML_AGENT_GIT_USER
      #   value: mygitusername
      # - name: CLEARML_AGENT_GIT_PASS
      #   valueFrom:
      #     secretKeyRef:
      #       name: git-password
      #       key: git-password
    resources: {}
      # limits:
      #   nvidia.com/gpu: 1
    tolerations: []
      # - key: "nvidia.com/gpu"
      #   operator: Exists
      #   effect: "NoSchedule"
    nodeSelector: {}
      # fleet: gpu-nodes

I'm assuming you are running your k8s cluster in a private cloud so this is why I suggested tag latest for clearml-agent-k8s (other tags are specific for was or gcp). With this configuration a new pod will be spawned when needed into your cluster and it should use SETUP SHELL SCRIPT like in docker-compose installations. Basically we use k8s glue agent instead of always running agents.

Pls let me know if this is working as expected because it may help me to release future improvements.

stefano-cherchi commented 2 years ago

Hey, thank you so much! I'm trying this right away.

Anyway I don't think idempotence and consistency are at stake even in the "old" configuration.

I can be wrong, but as far as I could understand while using the clearml-server deployed with the old chart, the content of the "CONTAINER" section of the UI was only applied to the "temporary" container that was spun up to execute the experiment. That is to say, the libgl1-mesa-glx was only installed in the ephemeral "docker-in-docker" container, not in the Kubernetes pod running the agent. Hence, the agent environment itself wasn't modified.

I'm pretty sure of this, because other experiments run later by the same agent, would also fail if they lacked the container setup command in the execution section.

It was pretty much the same thing that happens when you run a CI/CD pipeline in GitLab using a docker executor, or a ClearML experiment in a dedicated "ephemeral" EC2 instance using the AWS Autoscaler.

Anyway, I'll get back to you as soon as possible with a feedback about the k8s glue solution.

valeriano-manassero commented 2 years ago

Hi @stefano-cherchi , did you have the cache to try the proposed config? I'd love to get some feedback so I can prepare moving forward in chart development. Ty!

stefano-cherchi commented 2 years ago

Hi @valeriano-manassero, my apologies, I've been totally swamped this week. I'm planning to try it this afternoon. Stay tuned!

stefano-cherchi commented 2 years ago

Hey @valeriano-manassero , first quick feedback. Unfortunately the configuration you suggested doesn't work because the agent pod fails to start with the following error:

File "/usr/local/lib/python3.6/dist-packages/yaml/scanner.py", line 292, in stale_possible_simple_keys
"could not find expected ':'", self.get_mark())
yaml.scanner.ScannerError: while scanning a simple key
in "/root/template/template.yaml", line 30, column 5
could not find expected ':'
in "/root/template/template.yaml", line 31, column 3

The reason is, there is no pod template file in /root/template/template.yaml inside the container.

In order to provide such configuration I should either build the template file in the container's image or attach a volume from a secret as suggested here: https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue

I'll try the latter and hope to be able to get back to you with some useful info this afternoon.

valeriano-manassero commented 2 years ago

This is unexpected, during my tests I had this file generated from the configuration I posted here (it's under agentk8sglue.podTemplate ). Can you please share values.yaml used?

stefano-cherchi commented 2 years ago

Ok I've been digging deeper in my setup and it turned out that all sorts of problems were caused by the fact that I had all the clearml-server services deployed in the default namespace, while the glue agent tries to spin up the new pods in the clearml namespace.

This seems to be hardcoded somewhere in the templates, although I'm not sure: it seems to be defined in the argument namespace=args.namespace but I couldn't find the name of the ENV variable to override it in the values file (not a big issue).

Anyway, by the time I reconfigured my Ingress and clearml-server release to be deployed in the clearml namespace, everything magically fell into place.

Now the configurations added in the Container section of the UI are passed to the agent like before. In fact, having the experiments running in actual Pods is even better, because it provides better visibility and flexibility:

Absolutely sweet! Thank you so much for your support @valeriano-manassero. Owe you a beer!

valeriano-manassero commented 2 years ago

Super happy to hear it worked. I noticed the issue with namespace and I will try to fix it in next version using k8sglue as default. Will keep this issue open until new release. Ty!

valeriano-manassero commented 2 years ago

With release of 3.10, issue with namespace should be fixed. Ty for helping!