container job workflow pod fails to initialize - HttpError: HTTP request failed

sofiegonzalez commented 2 months ago

Checks

[X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
[X] I am using charts that are officially provided

Controller Version

latest

Deployment Method

Helm

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
[X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. set the runner to be in containerMode: kubernetes
2. create a PersistentVolumeClaim named `work` for the pods to use

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: work
  namespace: '${namespace}'
spec:
  storageClassName: <storageclass_name>
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

3. create a job that runs in a custom container. the `/var/run/secrets/kubernetes.io/serviceaccount` is mounted from the runner pod to access the `kube-api-access-` secrets

  job1:
    runs-on: gha-runner-scale-set
    container:
      image: <personal-container>
      volumes:
        - /var/run/secrets/kubernetes.io/serviceaccount:/var/run/secrets/kubernetes.io/serviceaccount

4. the container job tries to start, i can see a PVC created for the pod and bound, but the runner is unable to start the job and return this error. This only happens on container jobs and when a container step starts to run

CI Logs on the Initialize Container step

`Run '/home/runner/k8s/index.js'
  shell: /home/runner/externals/node16/bin/node {0}
Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.`

Describe the bug

Hi, My main issue is that the CI fails when I try to start a container job in containerMode: kubernetes with the error Error: HttpError: HTTP request failed. This is blocking us from making progress.

I have followed the github actions scale sets video on youtube, and tried to recreate the same configurations. The main difference being that I am using a PVC I have created through a manifest and am applying that with terraform. I am also using an docker image we built from a public docker repo, it is pull-able without authentication.

Right as the container job starts, the pod dies and fails to initialize. I can see the PVC was bound correctly. I am not sure what the Error: HttpError: HTTP request failed error means or what it is referring to.

Describe the expected behavior

The container job should start up and create a -workflow pod to run the container.

Additional Context

runner values.yaml
`## githubConfigUrl is the GitHub url for where you want to configure runners
## ex: https://github.com/myorg/myrepo or https://github.com/myorg
# default to mono for now
githubConfigUrl: <repo>

## githubConfigSecret is the k8s secrets to use when auth with GitHub API.
## You can choose to use GitHub App or a PAT token
# githubConfigSecret: gha-runner-scale-set-secret
githubConfigSecret:
  ### GitHub Apps Configuration
  ## NOTE: IDs MUST be strings, use quotes
  github_app_id: <gh_id>
  github_app_installation_id: <gh_install_id>
  github_app_private_key: <gh_pk>

  ### GitHub PAT Configuration
# github_token: ""
## If you have a pre-define Kubernetes secret in the same namespace the gha-runner-scale-set is going to deploy,
## you can also reference it via `githubConfigSecret: pre-defined-secret`.
## You need to make sure your predefined secret has all the required secret data set properly.
##   For a pre-defined secret using GitHub PAT, the secret needs to be created like this:
##   > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_token='ghp_your_pat'
##   For a pre-defined secret using GitHub App, the secret needs to be created like this:
##   > kubectl create secret generic pre-defined-secret --namespace=my_namespace --from-literal=github_app_id=123456 --from-literal=github_app_installation_id=654321 --from-literal=github_app_private_key='-----BEGIN CERTIFICATE-----*******'
# githubConfigSecret: pre-defined-secret

## proxy can be used to define proxy settings that will be used by the
## controller, the listener and the runner of this scale set.
#
# proxy:
#   http:
#     url: http://proxy.com:1234
#     credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
#   https:
#     url: http://proxy.com:1234
#     credentialSecretRef: proxy-auth # a secret with `username` and `password` keys
#   noProxy:
#     - example.com
#     - example.org

# maxRunners is the max number of runners the autoscaling runner set will scale up to.
maxRunners: 5

# minRunners is the min number of idle runners. The target number of runners created will be
# calculated as a sum of minRunners and the number of jobs assigned to the scale set.
minRunners: 2

# runnerGroup: "default"

runnerScaleSetName: "gha-runner-scale-set"

## A self-signed CA certificate for communication with the GitHub server can be
## provided using a config map key selector. If `runnerMountPath` is set, for
## each runner pod ARC will:
## - create a `github-server-tls-cert` volume containing the certificate
##   specified in `certificateFrom`
## - mount that volume on path `runnerMountPath`/{certificate name}
## - set NODE_EXTRA_CA_CERTS environment variable to that same path
## - set RUNNER_UPDATE_CA_CERTS environment variable to "1" (as of version
##   2.303.0 this will instruct the runner to reload certificates on the host)
##
## If any of the above had already been set by the user in the runner pod
## template, ARC will observe those and not overwrite them.
## Example configuration:
#
# githubServerTLS:
#   certificateFrom:
#     configMapKeyRef:
#       name: config-map-name
#       key: ca.crt
#   runnerMountPath: /usr/local/share/ca-certificates/

## Container mode is an object that provides out-of-box configuration
## for dind and kubernetes mode. Template will be modified as documented under the
## template object.
##
## If any customization is required for dind or kubernetes mode, containerMode should remain
## empty, and configuration should be applied to the template.
containerMode:
  type: "kubernetes"  #type can be set to dind or kubernetes
  ## the following is required when containerMode.type=kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    # For local testing, use https://github.com/openebs/dynamic-localpv-provisioner/blob/develop/docs/quickstart.md to provide dynamic provision volume with storageClassName: openebs-hostpath
    storageClassName: <storageclass_name>
    resources:
      requests:
        storage: 1Gi
  # kubernetesModeServiceAccount:
  #   annotations:

# listenerTemplate is the PodSpec for each listener Pod
# For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
# listenerTemplate:
#   spec:
#     containers:
#     # Use this section to append additional configuration to the listener container.
#     # If you change the name of the container, the configuration will not be applied to the listener,
#     # and it will be treated as a side-car container.
#     - name: listener
#       securityContext:
#         runAsUser: 1000
#     # Use this section to add the configuration of a side-car container.
#     # Comment it out or remove it if you don't need it.
#     # Spec for this container will be applied as is without any modifications.
#     - name: side-car
#       image: example-sidecar

## template is the PodSpec for each runner Pod
## For reference: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#PodSpec
template:
  ## template.spec will be modified if you change the container mode
  ## with containerMode.type=dind, we will populate the template.spec with following pod spec
  ## template:
  ##   spec:
  ##     initContainers:
  ##     - name: init-dind-externals
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["cp", "-r", "-v", "/home/runner/externals/.", "/home/runner/tmpDir/"]
  ##       volumeMounts:
  ##         - name: dind-externals
  ##           mountPath: /home/runner/tmpDir
  ##     containers:
  ##     - name: runner
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["/home/runner/run.sh"]
  ##       env:
  ##         - name: DOCKER_HOST
  ##           value: unix:///var/run/docker.sock
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##         - name: dind-sock
  ##           mountPath: /var/run
  ##     - name: dind
  ##       image: docker:dind
  ##       args:
  ##         - dockerd
  ##         - --host=unix:///var/run/docker.sock
  ##         - --group=$(DOCKER_GROUP_GID)
  ##       env:
  ##         - name: DOCKER_GROUP_GID
  ##           value: "123"
  ##       securityContext:
  ##         privileged: true
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##         - name: dind-sock
  ##           mountPath: /var/run
  ##         - name: dind-externals
  ##           mountPath: /home/runner/externals
  ##     volumes:
  ##     - name: work
  ##       emptyDir: {}
  ##     - name: dind-sock
  ##       emptyDir: {}
  ##     - name: dind-externals
  ##       emptyDir: {}
  ######################################################################################################
  ## with containerMode.type=kubernetes, we will populate the template.spec with following pod spec
  ## template:
  ##   spec:
  ##     containers:
  ##     - name: runner
  ##       image: ghcr.io/actions/actions-runner:latest
  ##       command: ["/home/runner/run.sh"]
  ##       env:
  ##         - name: ACTIONS_RUNNER_CONTAINER_HOOKS
  ##           value: /home/runner/k8s/index.js
  ##         - name: ACTIONS_RUNNER_POD_NAME
  ##           valueFrom:
  ##             fieldRef:
  ##               fieldPath: metadata.name
  ##         - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
  ##           value: "true"
  ##       volumeMounts:
  ##         - name: work
  ##           mountPath: /home/runner/_work
  ##     volumes:
  ##       - name: work
  ##         ephemeral:
  ##           volumeClaimTemplate:
  ##             spec:
  ##               accessModes: [ "ReadWriteOnce" ]
  ##               storageClassName: "local-path"
  ##               resources:
  ##                 requests:
  ##                   storage: 1Gi
  metadata:
    annotations:
      iam.amazonaws.com/role: <iam_role>
  spec:
    securityContext:
      runAsUser: 1001
      runAsGroup: 123
      fsGroup: 123
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]
      env:
        - name: ACTIONS_RUNNER_CONTAINER_HOOKS
          value: /home/runner/k8s/index.js
        - name: ACTIONS_RUNNER_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
          value: "false"
      volumeMounts:
        - name: work
          mountPath: /home/runner/_work
    # volumes:
    #   - name: work
    #     ephemeral:
    #       volumeClaimTemplate:
    #         spec:
    #           accessModes: [ "ReadWriteOnce" ]
    #           storageClassName: "local-path"
    #           resources:
    #             requests:
    #               storage: 1Gi
## Optional controller service account that needs to have required Role and RoleBinding
## to operate this gha-runner-scale-set installation.
## The helm chart will try to find the controller deployment and its service account at installation time.
## In case the helm chart can't find the right service account, you can explicitly pass in the following value
## to help it finish RoleBinding with the right service account.
## Note: if your controller is installed to only watch a single namespace, you have to pass these values explicitly.
# controllerServiceAccount:
#   namespace: arc-system
#   name: test-arc-gha-runner-scale-set-controller`

runner pvc.tpl

`apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: work
  namespace: '${namespace}'
spec:
  storageClassName: <storageclass_name>
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi`

Controller Logs

https://gist.github.com/sofiegonzalez/74277e957c6955cf88d94cc4516d2a1e

Runner Pod Logs

https://gist.github.com/sofiegonzalez/9e24f4e38db35dc967b255187826cf3b

github-actions[bot] commented 2 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

sofiegonzalez commented 2 months ago

I was able to spin up a -workflow pod by adding this to the values.yaml pod spec:

    spec:
        serviceAccount: gha-runner-scale-set-gha-rs-kube-mode

I got the solution from this comment. i dont understand why this fixed my issue, as the pod already has this service account definition in the pod spec on the cluster:

  ...
  securityContext: {}
  serviceAccount: gha-runner-scale-set-gha-rs-kube-mode
  serviceAccountName: gha-runner-scale-set-gha-rs-kube-mode
  ...

nikola-jokic commented 1 month ago

Hey @sofiegonzalez,

Can you please show the AutoscalingRunnerSet yaml definition when you don't specify the service account? I applied the similar spec you are using, and I was able to run the workflow pod.

sofiegonzalez commented 1 month ago

Hey @nikola-jokic sorry for the late response yes, so i removed the serviceAccount: gha-runner-scale-set-gha-rs-kube-mode annotation from the values.yaml and re-applied it to the cluster. The AutoscalingRunnerSet yaml looks like this for the runners in kubernetes mode: https://gist.github.com/sofiegonzalez/36108f31678e4113f6911d489e1a780d

this is what the AutoscalingRunnerSet looked like previously with the service account annotation: https://gist.github.com/sofiegonzalez/a9a8e447924294d060533ea472f6557e

nikola-jokic commented 1 month ago

No worries @sofiegonzalez!

I'm glad that you resolved the problem, but I don't understand why having serviceAccount field fixes the issue. The field has been deprecated, so my best guess is that either old service account is being used during upgrade, or there is a problem with the older kubernetes service.

Can you please try to install the new scale set without serviceAccount field. A fresh install, not an upgrade. If it works, then I might know what the problem is. I cannot reproduce this issue so I'm trying my best to understand it from the description.

sofiegonzalez commented 1 month ago

What do you mean by old service account and older kubernetes service? The service account I am referencing is the one created by the gha-runner-scale-set helm chart. We are on kubernetes v1.27.

I will try a fresh install without the serviceAccount field and update here, but I'm not going to do a fresh install of the gha-runner-scale-set-controller chart unless you think I need to.

sofiegonzalez commented 1 month ago

Hey @nikola-jokic just did a fresh install. Here is the values.yaml i used: https://gist.github.com/sofiegonzalez/bc12dd21217bdbba392c481b644527eb and an example workflow i created to run a personal container image in a job: https://gist.github.com/sofiegonzalez/16ae560f6ff3072c754b0eabc1c2850f

This time the workflow pod was able to initialize and run my personal container. I really don't understand what changed, before I had done upgrades and fresh installs when trying to get the workflow pod to start up.

nikola-jokic commented 1 month ago

I think I have an idea what the problem was. When doing upgrades, sometimes, removing additional resources can take a lot of time. This problem is fixed with this PR. When you did the upgrade, the resource was probably not completely removed. Now, after upgrading it, the role associated with that service account was probably in a bad state, causing no tokens to be mounted on the pod, and therefore lacking permissions.

That is the reason I asked you to do a fresh install :relaxed:. This should be fixed now, since we merged the PR I linked above.

sofiegonzalez commented 1 month ago

That makes sense, thanks for the clarification!

nikola-jokic commented 1 month ago

No worries! let's close this issue now and we can re-open it if you find that something else is a problem, especially since it works with the fresh install, and the PR I linked is already merged. Thank you for providing this information! The stuff written here and the stuff written on the container hook issue helped me better understand the issue.

actions / actions-runner-controller