actions / runner-container-hooks

Runner Container Hooks for GitHub Actions
MIT License
63 stars 41 forks source link

containerMode Kubernetes: new pods get ErrImagePull #144

Closed jmbravo closed 4 months ago

jmbravo commented 4 months ago

Hi,

I have changed the containerMode from dind to Kubernetes.

The problem is that when I launch the workflow the newly created pod cannot pull the image, why can this happen?

It seems to be a DNS error. On the other hand, the "parent" pod has configured the secret regcred and in dind mode it was pulling without problem.

What could have happened so that in kubernetes mode it can't pull?

Thanks!

---- ------ ---- ---- ------- Normal Pulling 24s kubelet Pulling image "artifactory.mycompany.com/cloudops-images/ubuntu-sqlplus:1.0" Warning Failed 24s kubelet Failed to pull image "artifactory.mycompany.com/cloudops-images/ubuntu-sqlplus:1.0": failed to pull and unpack image "artifactory.mycompany.com/cloudops-images/ubuntu-sqlplus:1.0": failed to resolve reference "artifactory.mycompany.com/cloudops-images/ubuntu-sqlplus:1.0": failed to do request: Head "https://artifactory.mycompany.com/v2/cloudops-images/ubuntu-sqlplus/manifests/1.0": dial tcp: lookup artifactory.mycompany.com on 10.77.252.41:53: no such host Warning Failed 24s kubelet Error: ErrImagePull Normal BackOff 24s kubelet Back-off pulling image "artifactory.mycompany.com/cloudops-images/ubuntu-sqlplus:1.0" Warning Failed 24s kubelet Error: ImagePullBackOff

Edit: my container registry has public IP so I need nameserver 8.8.8.8, I don't know why it's trying 10.77.252.41:53

jmbravo commented 4 months ago

Ok so it seems the workflow pod is not getting the docker credentials. The parent pod has a volume:

      volumeMounts:
        - mountPath: /home/runner/.docker/
          name: docker-secret
          readOnly: true
      volumes:
        - name: docker-secret
          secret:
            items:
              - key: .dockerconfigjson
                path: config.json
            secretName: regcred

But the workflow pod doesn't have it.

Is this normal?

nikola-jokic commented 4 months ago

Hey @jmbravo,

Please correct me if I'm wrong, but you should configure image pull secrets in this case. In container mode kubernetes, instead of using docker and running docker pull, we run a pod. So if you are using a private image for your pod, you would have to configure image pull secrets in order to allow kubernetes to pull the image properly. Can you please tell me how are you providing those credentials? If credentials are provided within a workflow, the hook will set the imagePullSecrets field.

jmbravo commented 4 months ago

Hey @nikola-jokic, thanks for your response.

I tried two things with no luck:

1- Add


      imagePullSecrets:
        - name: regcred

to my RunnerDeployment

2 - Add

imagePullSecrets:
 - name: regcred
image:
  actionsRunnerImagePullSecrets:
    - regcred

to Helm values.yml

What am I missing?

Where am I suppose to add the imagePullSecrets so my workflow pod gets it?

This is my complete RunnerDeploymeny yml

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: arc-runner-cloudops-test
  namespace: runner
spec:
  template:
    metadata:
    spec:
      imagePullSecrets:
        - name: regcred
      tolerations:
      - key: node-pool
        effect: NoSchedule
        operator: Equal
        value: runner
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
      dnsConfig:
          nameservers: 
            - 8.8.8.8
      containerMode: kubernetes
      serviceAccountName: default
      workVolumeClaimTemplate:
        storageClassName: "ebs-pool"
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 10Gi      
      volumeMounts:
        - mountPath: /home/runner/.docker/
          name: docker-secret
          readOnly: true 
      volumes:
        - name: docker-secret
          secret:
            items:
              - key: .dockerconfigjson
                path: config.json
            secretName: regcred
      organization: mycompany
      group: amazon-github-runners-cloudops-test
      labels:
        - arc-runner-cloudops-test
      env:
        - name: ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT
          value: "true"
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "120"
      terminationGracePeriodSeconds: 180
      imagePullPolicy: IfNotPresent

Thank you!

nikola-jokic commented 4 months ago

Oh of course, happy to help! :relaxed:

With the current setup, the hook will not be able to see the image pull secrets you specified. There are two ways you can do this:

  1. Specify credentials in the workflow file. This way, credentials are going to be passed by the runner to the hook, and the hook will apply it to the new job pod
  2. Use hook extensions. This approach is slightly harder to maintain, but it allows you to modify the hook behavior and the spec that is applied to the job pod. With the new release, the hook will be able to target service containers for these modifications as well :relaxed:. To configure it, please read the ADR.
jmbravo commented 4 months ago

Thanks, that makes sense!

However, if I understood you correctly, you mean adding the credentials in the Github workflow. I have done that, but I still get the same error:

jobs:
  Getting-data-from-RDS:
    runs-on: arc-runner-cloudops-test
    container:
      image: artifactory.mycompany.com/cloudops-test/ubuntu-sqlplus:1.0
      credentials:
        username: $ARTIFACTORY_USER_TEST
        password: $ARTIFACTORY_PASSWORD_TEST

Did you mean this? Sorry, I've been like this for a couple of days now and I'm a bit blind.

nikola-jokic commented 4 months ago

Yes, that is exactly what I meant. Is it not working? Can you please show output of kubectl get pod $YOUR_JOB_POD -o yaml to see what is actually applied to it?

jmbravo commented 4 months ago

It seems I'm totally blind. I can see now the imagePullSecrets:


  imagePullSecrets:
  - name: arc-runner-cloudops-test-7pqzd-868s8-secret-065e7127

But for some reason, I'm getting the same DNS error:

kubelet Failed to pull image "artifactory.mycompany.com/test-cloudops-images/ubuntu-sqlplus:1.0": failed to pull and unpack image "artifactory.mycompany.com/test-cloudops-images/ubuntu-sqlplu │ │ s:1.0": failed to resolve reference "artifactory.mycompany.com/test-cloudops-images/ubuntu-sqlplus:1.0": failed to do request: Head "https://artifactory.mycompany.com/v2/test-cloudops-images/ubuntu-sqlplus/manifests/1.0": dial tcp: loo │ │ kup artifactory.mycompany.com on 10.77.252.41:53: no such host

I got the base64 content and it's totally fine, so I don't know why job-pods can't resolve my artifactory url, but can resolve ECR or dockerhub ones.

To be honest, I don't know whose ip that is. There's no pod or service with that 10.77.252.41:53 ip

I'm running out of ideas, I also run the workflow with an ECR image with a sleep, got into the pod and I was able to resolve artifactory url without any problem.

Thanks again!

nikola-jokic commented 4 months ago

Oh, that IP looks like internal Kubernetes IP since it is in a private range. I am not sure exactly why is it resolving to that IP, but this is definitely outside of hook's control. The port is DNS so it is probably trying to resolve it and failing.

This is the first time I'm seeing this problem, and I'm curious why did you add dns config to the deployment? If that is the requirement, then you are probably out of luck, and you will have to use a hook extension... But, in that case, if you already added credentials to your workflows, your extension can only modify dns configuration and it should eliminate the issue. Please let me know if that makes sense.

jmbravo commented 4 months ago

Actually, adding the DNS config to the runnerDeployment is not neccesary, since I added to my coredns corefile before, but I was desperate and I tried this. No luck at all.

Above is my coredns corefile, and the artifactory ip is resolving on every pod but the workflow one. I am at an impasse.

        .:53 {
        errors
        health {
        lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
        }
        artifactory.mycompany.com:53 {
        errors
        cache 30
        forward . 8.8.8.8
        reload
        }

Anyway, thanks for your help, I'll keep trying!

jmbravo commented 4 months ago

Closing this. It was a conflict with our private DNS and kubelet's, since EKS nodes are in the VPC that has Direct Connect.

Thank you for your patience and support!