allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
233 stars 90 forks source link

k8s glue container name merge issue and extra documentation for k8s glue integration #54

Open Shaked opened 3 years ago

Shaked commented 3 years ago

I have integrated clearml agent with our k8s cluster using the k8s glue.

As part of my work, I have create the following pod template:

apiVersion: v1
kind: Pod
metadata:
  name: template-name
spec:
  containers:
    - name: test
      env:
      - name: "GIT_SSH_COMMAND"
        value: "ssh -i /root/.ssh/id_rsa"
      - name: TEST_SHAKED
        value: "1"
      volumeMounts:
      - name: full-ro-rsa-sec
        mountPath: /root/.ssh/id_rsa
        subPath: id_rsa
        readOnly: true
  volumes:
    - name: full-ro-rsa-sec
      secret:
        secretName: full-ro-rsa
        defaultMode: 256
        items:
          - key: id_rsa
            path: id_rsa

After some digging, I figured that I have to install the agent from the master branch as stated in #51

Once I tried to run the agent, I have encountered the following error:

Running kubectl encountered an error: error: error validating "/tmp/clearml_k8stmpl_8288bk09.yml": error validating data: ValidationError(Pod.spec.containers[0].name): invalid type for io.k8s.api.core.v1.Container.name: got "map", expected "string"; if you choose to ignore these errors, turn validation off with --validate=false

My assumption is that somewhere around https://github.com/allegroai/clearml-agent/blob/master/clearml_agent/glue/k8s.py#L444 something is not being merged correctly and instead of overriding the - name: test, it creates something like - name: { 0: "test", 1: "clearml-....." } or - name: ["test", "clearml-...."]


Regardless, following my conversation with @bmartinn which has helped me a lot via Slack, I wanted to state some important things while working with the agent and k8s, especially because it wasn't clear to me how to inject my SSH key and run a git clone on a private git repository.

clearml.conf

Make sure to set force_git_ssh_protocol: true

pod-template.yaml

You have to consider two things:

  1. You want to inject your SSH key in a secure way. I am working with Azure KeyVault which injects a k8s secret using https://github.com/Azure/secrets-store-csi-driver-provider-azure and then I just mount the secret to the pod with the right permissions i.e defaultMode: 256
  2. You have to ensure that the host of your repo won't stop you from clonning e.g:

    Host key verification failed. fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

In order to make this work you have to make sure git knows about your SSH key and that it doesn't require a strict host key checking. This can be done by overriding the GIT_SSH_COMMAND environment variable:

apiVersion: v1
kind: Pod
metadata:
  name: template-name
spec:
  containers:
    - env:
      - name: TRAINS_CONFIG_FILE
        value: "/secrets/clearml.conf"
      - name: GIT_SSH_COMMAND
        value: "ssh -i /root/.ssh/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
      volumeMounts:
      - name: full-ro-rsa-sec
        mountPath: /root/.ssh/id_rsa
        subPath: id_rsa
        readOnly: true
      - name: clearml-conf-sec
        mountPath: /secrets/clearml.conf
        subPath: clearml.conf
        readOnly: true
  volumes:
    - name: full-ro-rsa-sec
      secret:
        secretName: full-ro-rsa
        defaultMode: 256
        items:
          - key: id_rsa
            path: id_rsa
    - name: clearml-conf-sec
      secret:
        secretName: clearml-conf
        items:
          - key: clearml.conf
            path: clearml.conf

Dockerfile

You can use the following Dockerfile in order to create a small and simple agent:

FROM python:3.9-slim

RUN apt update && apt install -y \
    git \
    apt-transport-https gnupg2 curl
RUN curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
RUN install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
ENV CLEARML_CONFIG_FILE=/secrets/clearml.conf
ENV GIT_SSH_COMMAND="ssh -i /root/.ssh/id_rsa"
RUN python3 -m pip install git+https://github.com/allegroai/clearml-agent.git
ADD k8s_glue_example.py k8s_glue_example.py

ENTRYPOINT [ "python3", "k8s_glue_example.py" ]

As you can see, I'm also injecting clearml.conf through Azure KeyVault just as I prefer to manage my configuration in that way. You can do it however you feel like though.

k8s deployment

The only thing left is your deployment.yaml (or helm package). Once you have built the above docker image, you can run it and just pass the relevant arguments, for example:

          args: 
            - --queue
            - "shaked-test"
            - --template-yaml
            - /secrets/pod-template.yaml

Same as I stated above, the pod-template.yaml is also a configuration in my perspective and I just inject it from the outside world so that I won't have to rebuild the image everytime from scratch.

All the best, Shaked

UPDATE:

I have added clearml.conf to the pod template volumes otherwise without it you might end up with #55.

jkhenning commented 3 years ago

We've just pushed a fix to the template merging in the k8s glue, worth a try with the latest agent code from the repo 🙂