Initialize Containers - HttpError: HTTP request failed - EKS - containerMode kubernetes

carl-reverb commented 6 months ago

When I attempt to run a workflow against a self-hosted runner deployed using the gha-runner-scale-set-controller and gha-runner-scale-set charts, my job fails on the 'Initialize Containers' step.

Runner Scale Set values.yaml:

minRunners: 1
maxRunners: 16

containerMode:
  type: kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "ebs-gp3-ephemeral"
    resources:
      requests:
        storage: 10Gi

template:
  spec:
    securityContext:
      fsGroup: 123
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]

In the Github UI after the job is picked up the following error message appears in the log:

Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

Full error context:

##[debug]Evaluating condition for step: 'Initialize containers'
##[debug]Evaluating: success()
##[debug]Evaluating success:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Initialize containers
##[debug]Register post job cleanup for stopping/deleting containers.
Run '/home/runner/k8s/index.js'
##[debug]/home/runner/externals/node1[6](https://github.com/reverbdotcom/reverb-terraform/actions/runs/7426318988/job/20311530393#step:2:6)/bin/node /home/runner/k[8](https://github.com/reverbdotcom/reverb-terraform/actions/runs/7426318988/job/20311530393#step:2:8)s/index.js
Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug]System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug] ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'PrepareJob' did not execute successfully
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   --- End of inner exception stack trace ---
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.PrepareJobAsync(IExecutionContext context, List`1 containers)
##[debug]   at GitHub.Runner.Worker.ContainerOperationProvider.StartContainersAsync(IExecutionContext executionContext, Object data)
##[debug]   at GitHub.Runner.Worker.JobExtensionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Initialize containers

My workflow:

name: git hooks
on: push

jobs:
  pre-commit:
    name: pre-commit
    runs-on: reverbdotcom-general-purpose
    container: summerwind/actions-runner:latest
    steps:
      - run: echo "hello actions"

I have tried a lot of different things to try to understand what is not working here but the chain of dependencies and effects is not easy to comprehend. There is a lot of red herrings and other noise in the logs which led me on several chases around the web, and I spent a while trying security contexts, various container images, etc. At this point I think I have run out of time to figure this out and will have to fall back to the previous actions runner controller and advise my team that the next generation of actions runners is a risk and we should evaluate alternative CI pipelines.

carl-reverb commented 6 months ago

Well, finding some more time to dig, I went into the source code here and started tracing out the execution path since the stack trace doesn't give many clues as to where this HTTP request failed. Noting that the first thing that probably does a request is: https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/k8s/index.ts#L455

I shell into my pod and install node and then attempt this direct basic implementation:

const k8s = require('@kubernetes/client-node');

const kc = new k8s.KubeConfig();
kc.loadFromDefault();

const k8sApi = kc.makeApiClient(k8s.CoreV1Api);

let main = async () => {
    try {
        const podsRes = await k8sApi.listNamespacedPod('actions-runners');
        console.log(podsRes.body);
    } catch (err) {
        console.error(err);
    }
};

main();

It fails like so:

{
   // ...
 body: {
    kind: 'Status',
    apiVersion: 'v1',
    metadata: {},
    status: 'Failure',
    message: 'Unauthorized',
    reason: 'Unauthorized',
    code: 401
  },
  statusCode: 401
}

So I can presume that the problem is not the fault of the hooks library, but something is wrong with either the service account or the cluster configuration in EKS. There's not a lot of easily-findable documentation on how to perform in-cluster authentication via service account because most users want to authenticate to their cluster from outside, using eksctl or similar.

The Role is as configured by the helm chart:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: reverbdotcom-general-purpose-gha-rs-kube-mode
  namespace: actions-runners
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - create
  - delete
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - get
  - create
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - get
  - list
  - create
  - delete
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get
  - list
  - create
  - delete

Apparently if there is some RBAC issue I should receive a 403. A 401 indicates that the token was rejected completely. I also checked to see if the token in the client configuration matched the one mounted in the pod, and it does.

I'm out of ideas for now... until I can learn more about debugging 401 with an in-cluster service account token.

carl-reverb commented 6 months ago

I created a test pod on the cluster in the actions-runners namespace using the latest node image and attached to it, then ran the short js script to test. The result is a 403:

 body: {
    kind: 'Status',
    apiVersion: 'v1',
    metadata: {},
    status: 'Failure',
    message: 'pods is forbidden: User "system:serviceaccount:actions-runners:default" cannot list resource "pods" in API group "" in the namespace "actions-runners"',
    reason: 'Forbidden',
    details: { kind: 'pods' },
    code: 403
  },
  statusCode: 403
}

This is expected because I didn't specify a service account, so I got the default service account which has no role bound to it. Next I attempted the same, but specified the reverbdotcom-general-purpose-gha-rs-kube-mode service account.

~ $ kubectl run -it -n actions-runners carl-test --image=node --overrides='{ "spec": { "serviceAccount": "reverbdotcom-general-purpose-gha-rs-kube-mode" } }' -- bash

With this service account, I again get a 401. Since this is on an unrelated pod, which works with default service account, there is something wrong with the service account.

The default service account has a "mountable secret" but the "reverbdotcom-general-purpose-gha-rs-kube-mode" does not.

Name:                reverbdotcom-general-purpose-gha-rs-kube-mode
Namespace:           actions-runners
Labels:              <redacted>
Annotations:         <redacted>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

We're on kubernetes 1.24, which is now undocumented, so I can't be sure but other documentation indicates that it shouldn't be necessary to manually create tokens, and that the admission controller should use the Refresh api to obtain a token for the projected volume when a pod is scheduling... It definitely obtains a token, but the token is unauthorized.

Off to spend some time digging around in EKS docs and attempting to figure out if there's some configuration setting I need to flip.

carl-reverb commented 6 months ago

After more experimentation I accidentally deleted the service account, and then had to recreate it by forcing a new helm install-upgrade.

Following that, new pods which used the kube-mode service account were able to communicate with the apiserver, but old pods were not. I destroyed the old runner pod and waited for the controller to create a new one, whereupon it was able to make apiserver requests again.

Unknown why replacing the serviceaccount made it start working, monitoring to see if it breaks again after some interval of time. If so, then theory goes that the projected token is not being refreshed.

nikola-jokic commented 5 months ago

Hey @carl-reverb,

Sorry for the late response. Is there any news regarding this issue? Does it work now?

carl-reverb commented 5 months ago

Yes, it's been working now, thank you.

carl-reverb commented 4 months ago

Ok, I reproduced this as I'm rolling out a new set of runner. My first runner, an arm64 kubernetes-mode runner again presented this issue. The Chart version is v0.8.2. The workaround was the same:

Remove the finalizer blocking deletion of the serviceaccount
Delete the serviceaccount
Force-update the helm release

I also happened to deploy an amd64 kubernetes-mode runner at the same time. My job contained a javascript action and ran in an alpine container so I had to switch it to the amd64. Once again, HTTP denied. Repeat my workaround ... and it works again.

nikola-jokic commented 3 months ago

Could you please write the exact steps that you are doing to land on this spot? I can't seem to reproduce the issue. It seems like there is some kind of permission issue where service account is not mounted to the runner container.

Could you please write an example values.yaml file with the stuff you would like to hide redacted? Write exact commands that you are using to deploy this scale set. I just can't reproduce this issue

carl-reverb commented 3 months ago

This is AWS EKS version 1.25. I'm sorry I don't have the bandwidth to work on reproduction. I simply install the runner scale set helm chart oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set with a values file such as: https://gist.github.com/carl-reverb/05bb00856a7e5da70e1020fba65bc1ee

My hook extension is:

{
   "apiVersion": "v1",
   "data": {
      "extension.yaml": "\"spec\":\n  \"serviceAccount\": \"gha-job-container\""
   },
   "kind": "ConfigMap",
   "metadata": {
      "name": "gha-runner-scale-set-hook-extension",
      "namespace": "actions-runners"
   }
}

Now, all this is being installed with Flux, so the order of application is up to those controllers. Because of the nature of the resources, I presume that the kustomize controller runs first and creates my service account, config maps, and the helm release CRD, upon which the helm controller runs to evaluate the Helm Release and execute helm install.

Sorry I can't remember much more detail than this.

nikola-jokic commented 3 months ago

Oh, please do not apologize, I'm the one being late on this issue. I think the problem is that we are creating service account on demand and mounting it on the runner pod. The hook does extension does not need a service account. Extension is only scoped to the workflow pod. The service account needs to be mounted on the runner. It is likely that something in the tooling is not mounting the service account properly.

carl-reverb commented 3 months ago

Oh, to explain the hook extension on the service account, you're right that's a red herring, but I do need that because I'm using docker buildx with the native kubernetes driver and that driver requires a service account with some role bindings in order to create buildx pods.

Yes you're absolutely right, the service account which is a problem is the one for the runner pod which consumes the workflow and then fails to spawn job pods due to not having a good service account token.

nikola-jokic commented 3 months ago

Right, but the role you pasted is not the one the runner needs.This is the actual role that is created for the runner: https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/templates/kube_mode_role.yaml

Is it possible that the incorrect role binding was made, so runner did not have enough permissions?

timmjd commented 2 months ago

Had the same issue. Seems to happen due to namespace rename OR a helm chart upgrade - our IT did both simultaneously.

For me the following had caused the issue:

Install of helm chart 0.6.0 in a namespace foo
Deploy some ARC runners
Upgrade the runner controller to a newer version, in my case 0.9.0
Also rename the namespace of the runner controller from foo to something else like bar
The ARC runners are untouched. The existing ARC runner is not working anymore - it's still there but will crash without any further details except an HTTP error

Somehow you run into the 401 error, for me it was the runner-hook trying to do a get on the K8S api about "am I allowed to access a secret". Fix was to delete the helm of the ARC runner and re-deploy it. I guess the serviceaccount did not had the permission required either due to rename or version upgrade.

I was debugging this for 2 days and only found the origin after implementing #158 / #159. With the trace beeing available, finding the root cause was an easy job. Maybe @nikola-jokic could have a look on this PR?

carl-reverb commented 2 months ago

Reproduced the issue again, this time on 0.9.1

Upgrade runners and controller charts to 0.9.1
Much issues, end up following upgrade guide to uninstall everything, delete CRDs, then reinstall with helm.
Jobs going to runners fail with HTTP error.
Delete all the *-gha-rs-kube-mode service accounts, patching the annoying finalizer.
Force-reinstall helm chart to get the service accounts recreated.
Workaround success, jobs start working again.

sofiegonzalez commented 2 months ago

hi @carl-reverb, before i was unable to run a job in a container in containerMode: Kubernetes because of the HttpError and the runner pod being unable to inialize, but you solution in this comment where you added the serviceAccount label to the runner spec solved my issue. BUT I don't understand why. When I look at the runner pod, it contains two serviceAccount definitions now. If i removed the serviceAccount one, it is unable to spin up the workflow pod. Do you know why this is or why this fix allows the runner pod to start and create the -workflow pod to run the container?

  serviceAccount: gha-runner-scale-set-kube-mode
  serviceAccountName: gha-runner-scale-set-kube-mode

tleerai commented 2 weeks ago

@carl-reverb you are the MVP!

Uninstall runner set helm chart, then find all remaining resources in namespace:

kubectl api-resources --verbs=list --namespaced -o name   | xargs -n 1 kubectl get --show-kind --ignore-not-found -n arc-runners

Edit and remove finalizer:

finalizers:

actions.github.com/cleanup-protection

Then reinstall the runner set. This alone seemed to solve the problem for me.

actions / runner-container-hooks

Initialize Containers - HttpError: HTTP request failed - EKS - containerMode kubernetes #128