actions / runner-container-hooks

Runner Container Hooks for GitHub Actions
MIT License
67 stars 43 forks source link

Seeing `HTTP request failed` when using gha scale set in Kubernetes mode #113

Closed MathiasPius closed 6 months ago

MathiasPius commented 10 months ago

We've been experiencing an absolutely absurd amount of different issues, some sporadic, some consistent, while configuring the GitHub Actions Runnner Scale Set for our cluster. We initially started with a dind configuration to avoid having to use Persistent Volumes, but experienced sporadic and inexplicable errors and therefore switched to Kubernetes using the local-path-provisioner by Rancher.

This is the Helm values we're using with version 0.6.1 of the gha scale set (and controller), which supposedly uses v0.4.0 of the k8s container hooks:

githubConfigSecret: repo-name-secret
    githubConfigUrl: https://github.com/Org/hidden-repo-name
    maxRunners: 16
    minRunners: 16
    containerMode:
      type: 'kubernetes'
      kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "local-path"
        resources:
          requests:
            storage: 1Gi
    template:
      spec:
        containers:
        - name: runner
          image: ghcr.io/actions/actions-runner:latest
          command: ["/home/runner/run.sh"]
          env:
            - name: ACTIONS_RUNNER_CONTAINER_HOOKS
              value: /home/runner/k8s/index.js
            - name: ACTIONS_RUNNER_POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
              value: "true"
            - name: ACTIONS_RUNNER_FORCED_INTERNAL_NODE_VERSION
              value: node19
          volumeMounts:
            - name: work
              mountPath: /home/runner/_work

We did not initially set ACTIONS_RUNNER_FORCED_INTERNAL_NODE_VERSION=node19 but ran into problems with the actions/checkout@v4 action attempting to call a node20 binary.

Using the above setup, and regardless of whether we're using a custom container which has worked previously, or switch to using the ubuntu-latest runner image, all jobs fail immediately at the "Initializing containers"-stage, with the following error:

valuating success:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Initialize containers
##[debug]Register post job cleanup for stopping/deleting containers.
Run '/home/runner/k8s/index.js'
##[debug]/home/runner/externals/node16/bin/node /home/runner/k8s/index.js
Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug]System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug] ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'PrepareJob' did not execute successfully
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   --- End of inner exception stack trace ---
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.PrepareJobAsync(IExecutionContext context, List`1 containers)
##[debug]   at GitHub.Runner.Worker.ContainerOperationProvider.StartContainersAsync(IExecutionContext executionContext, Object data)
##[debug]   at GitHub.Runner.Worker.JobExtensionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Initialize containers

Any help or insights are greatly appreciated.

nikola-jokic commented 8 months ago

Hey @MathiasPius,

Why are you specifying volume again? When you use containerMode, the volume will be expanded by the helm. I think that might be the source of your problem. As for the error message, this PR should help diagnose problems in some situations: https://github.com/actions/runner-container-hooks/pull/123 However, I do believe that the source of the problem is the volumeMounts field that should not be specified

veronluc-ansys commented 7 months ago

Hi @nikola-jokic , I am facing the same issue and have real troubles debugging. How could I use the PR you mentioned? Should I rebuild a custom runner image that would use the master branch of the runner-container-hooks repo from here? Not sure if you got more information in the meantime. Here is the value file I am using:

githubConfigUrl: "X"
githubConfigSecret: X
runnerGroup: "X"

containerMode:
  type: "kubernetes"
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "default"
    resources:
      requests:
        storage: 4Gi

template:
  metadata:
    labels:
      app: X
  spec:
    serviceAccountName: X
    securityContext:
      fsGroup: 123
    containers:
      - name: runner
        image: X/actions-runner:latest
        command: [ "/home/runner/run.sh" ]
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "16"
            memory: "32Gi"
    tolerations:
      - key: "os_type"
        operator: "Equal"
        value: "linux"
        effect: "NoSchedule"
    nodeSelector:
      os_type: linux
    dnsPolicy: "None"
    dnsConfig:
      nameservers:
        - X.X.X.X
        - Y.Y.Y.Y
      searches:
        - X.com
        - Y.X.com
      options:
        - name: ndots
          value: "2"
        - name: edns0

It could maybe come from the taint/tolerations, or the dnsConfig? It seems that the volume is mounted correctly on my side.

nikola-jokic commented 7 months ago

Hey @veronluc-ansys, yes, creating an image with a newly built hook would be great! I understand it would be hard to diagnose, but if you could turn on debugging on your workflow and test the latest version of the hook, your input would be very valuable to improve troubleshooting in these situations :relaxed:

veronluc-ansys commented 7 months ago

Quick update on my side - I created a custom runner image using the runner image template. In this image, I've cloned and built the main branch of the container-runner-hooks in order to get the latest version. The error message is indeed much better, and figured out what was wrong - could not create the pod due to some internal policies. If I might suggest, it would be great to have the latest version of the main branch easily accesible, e.g. with some release candidate versions - in my case, I did rebuild the project inside the docker image - not sure if a better solution existed. Anyway, thank you @nikola-jokic for your support, let me know if I can contribute to improving this workflow :) I believe this issue can be closed, but it seems that releasing these debugging features might be helpful for some people 👌

panbanda commented 6 months ago

I am actually encountering the same thing. The dind mode works but the kubernetes mode has the same HTTP error you mentioned above. Here is my config from terraform:

      runnerGroup: ${var.runner_group}
      githubConfigUrl: ${var.github_config_url}
      githubConfigSecret: ....

      # This works:
      # containerMode:
      #   type: dind
      # template:
      #   spec:
      #     tolerations:
      #       - key: "x.com/nodegroup"
      #         operator: "Equal"
      #         value: "github-runners"
      #         effect: "NoSchedule"
      #     containers:
      #       - name: runner
      #         image: ghcr.io/actions/actions-runner:latest
      #         command: ["/home/runner/run.sh"]
      #         resources:
      #           limits:
      #             cpu: ${var.limit_cpu}
      #             memory: ${var.limit_memory}

      containerMode:
        type: kubernetes
        kubernetesModeWorkVolumeClaim:
          accessModes: ["ReadWriteOnce"]
          storageClassName: gp2
          resources:
            requests:
              storage: 2Gi
      template:
        spec:
          securityContext:
            fsGroup: 1001
          tolerations:
            - key: "x.com/nodegroup"
              operator: "Equal"
              value: "github-runners"
              effect: "NoSchedule"
          containers:
            - name: runner
              image: ghcr.io/actions/actions-runner:2.312.0
              command: ["/home/runner/run.sh"]
              resources:
                limits:
                  cpu: ${var.limit_cpu}
                  memory: ${var.limit_memory}
Screenshot 2024-01-31 at 9 12 54 PM

I don't think i'm following how you resolved your issue. Any tips?

nikola-jokic commented 6 months ago

Hey @panbanda,

The latest runner version (ghcr.io/actions/actions-runner:2.312.0) rolled-back the latest hook version. We had problems with the alpine container, which should be fixed once the 0.5.1 version is out. We are hoping to re-publish the image once the hook version is out, so you can debug this problem.

Currently, you would have to build the hook on your own in order to have a more expressive error message. If you can't wait for the next hook release, please build your hook based on this branch. This should eliminate alpine checks completely.

panbanda commented 6 months ago

@nikola-jokic thanks for that explanation. So, I built the hooks and packaged that in the dockerfile per the above comment like this:

FROM mcr.microsoft.com/dotnet/runtime-deps:6.0 as build

# Replace value with the latest runner release version
# source: https://github.com/actions/runner/releases
# ex: 2.303.0
ARG RUNNER_VERSION="2.312.0"
ARG RUNNER_ARCH="x64"

ENV DEBIAN_FRONTEND=noninteractive
ENV RUNNER_MANUALLY_TRAP_SIG=1
ENV ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT=1

RUN apt update -y && apt install curl unzip -y

RUN adduser --disabled-password --gecos "" --uid 1001 runner \
    && groupadd docker --gid 123 \
    && usermod -aG sudo runner \
    && usermod -aG docker runner \
    && echo "%sudo   ALL=(ALL:ALL) NOPASSWD:ALL" > /etc/sudoers \
    && echo "Defaults env_keep += \"DEBIAN_FRONTEND\"" >> /etc/sudoers

WORKDIR /home/runner

RUN curl -f -L -o runner.tar.gz https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-linux-${RUNNER_ARCH}-${RUNNER_VERSION}.tar.gz \
    && tar xzf ./runner.tar.gz \
    && rm runner.tar.gz

# Your branch built and copied in
COPY actions-runner-hooks-k8s-nikola-jokic-fix-is-alpine.zip runner-container-hooks.zip

RUN unzip ./runner-container-hooks.zip -d ./k8s \
    && rm runner-container-hooks.zip

USER runner

However, when I run the action with that container it still outputs the same obscure error.

Screenshot 2024-02-01 at 5 24 12 PM
nikola-jokic commented 6 months ago

Can you also please provide an example workflow file so I can reproduce the issue and fix this for you? It doesn't have to be your real workflow, but the minimal workflow that consistently fails.

panbanda commented 6 months ago

@nikola-jokic thanks for your help. Actually it's failing on any workflow running on this runner. It's failing at the CleanupJob hook after every step that is permitted to run (in all cases Initialize containers and Stop containers. Here is what we currently have set up from karpenter onward:

      runnerGroup: ${var.runner_group}
      githubConfigUrl: ${var.github_config_url}
      githubConfigSecret:
        github_token: ${var.github_token}
      maxRunners: ${var.max_runners}
      minRunners: ${var.min_runners}

      # containerMode:
      #   type: dind
      # template:
      #   spec:
      #     tolerations:
      #       - key: "x.com/nodegroup"
      #         operator: "Equal"
      #         value: "github-runners"
      #         effect: "NoSchedule"
      #     containers:
      #       - name: runner
      #         image: ghcr.io/actions/actions-runner:latest
      #         command: ["/home/runner/run.sh"]
      #         resources:
      #           limits:
      #             cpu: ${var.limit_cpu}
      #             memory: ${var.limit_memory}

      containerMode:
        type: kubernetes
        kubernetesModeWorkVolumeClaim:
          accessModes: ["ReadWriteOnce"]
          storageClassName: gp2
          resources:
            requests:
              storage: 2Gi
      template:
        spec:
          securityContext:
            fsGroup: 1001
          tolerations:
            - key: "x.com/nodegroup"
              operator: "Equal"
              value: "github-runners"
              effect: "NoSchedule"
          containers:
            - name: runner
              # image: ghcr.io/actions/actions-runner:2.310.2
              image: XXX.dkr.ecr.us-east-2.amazonaws.com/actions-runner:latest
              command: ["/home/runner/run.sh"]
              resources:
                limits:
                  cpu: ${var.limit_cpu}
                  memory: ${var.limit_memory}

It does run with dind mode if I uncomment that-- something about the k8s mode that I don't quite understand

panbanda commented 6 months ago

@nikola-jokic thanks for cutting that release so quickly! I wasn't expecting that. Looks like with the latest runner and hooks at 0.5.1 I dont get the HTTP error message. I get another one but need to dive into it. Thanks again!

nikola-jokic commented 6 months ago

Thank you for your kind words! Please let us know if you notice any issues debugging it so we can continue to improve the hook experience :relaxed:

nikola-jokic commented 6 months ago

Let's close this one, since the runner 2.213.0 is released with the newest hook version. Please open a new issue when you need further improvements. Thank you all for being active here!

theophileds commented 6 months ago

Hi there,

I've run into an issue where I'm consistently encountering an "HTTP request failed" error when using services with containerMode.kubernetes.

Here's a breakdown of my setup:

gha-runner-scale-set-controller: 0.8.2 gha-runner-scale-set: 0.8.2 actions-runner: 2.313.0

I've configured the runner helm chart with:

containerMode:
    type: "kubernetes"
    kubernetesModeWorkVolumeClaim:
        accessModes: ["ReadWriteOnce"]
        storageClassName: "gp3"
        resources:
            requests:
                storage: 10Gi

template:
    spec:
        # this is required for non-root to access the pvc
        securityContext:
            fsGroup: 123
        containers:
            - name: runner
              image: ghcr.io/actions/actions-runner:latest
              command: ["/home/runner/run.sh"]
              env:
                  # allow jobs without a job container to run
                  - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
                    value: "false"
              resources:
                  requests:
                      memory: 3Gi
                      cpu: 1
                  limits:
                      memory: 3Gi
                      cpu: 1

Additionally, I attempted to build a runner image using the latest version of k8s hooks v0.5.1, but unfortunately without succcess. As a workaround, I reverted to using dind, which seemed to work.

The logs from the runner indicated the following error:

[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner] Caught exception from step: System.Excep
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]  ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'CleanupJob' did not execute successfully
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    --- End of inner exception stack trace ---
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.CleanupJobAsync(IExecutionContext context, List`1 containers)
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    at GitHub.Runner.Worker.ContainerOperationProvider.StopContainersAsync(IExecutionContext executionContext, Object data)
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    at GitHub.Runner.Worker.JobExtensionRunner.RunAsync()
2024-02-24 17:01:06.839
[WORKER 2024-02-24 11:31:06Z ERR  StepsRunner]    at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
2024-02-24 17:01:06.839

Any insights or suggestions on resolving this issue would be greatly appreciated!

nikola-jokic commented 6 months ago

Hey, could you please provide a job that can reproduce this issue? Can you also please turn on debug on the workflow so the output of the failure is shown?

theophileds commented 6 months ago

Hey @nikola-jokic

Thanks for your quick response!

Whenever I try to set up the following job, it fails:

name: arc-debug
permissions: write-all
on:
    pull_request:
    push:
        branches:
            - main

env:
    ACTIONS_STEP_DEBUG: true

jobs:
    tests:
        runs-on: arc-runner-set
        name: tests
        services:
            mysql:
                image: mysql:8.0
                ports:
                    - 3306:3306
                env:
                    MYSQL_DATABASE: test
                    MYSQL_USER: test
                    MYSQL_PASSWORD: test
                    MYSQL_ROOT_PASSWORD: test
                options: --health-cmd="mysqladmin ping" --health-interval=5s --health-timeout=5s --health-retries=10

        steps:
            -   name: Checkout
                uses: actions/checkout@v4

Despite enabling debug mode, the output doesn't provide much insight beyond the following error message:

Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
theophileds commented 6 months ago

I finally managed to get the debug more by checking the debug checkbox!

Here's the debug output with actions-runner:2.314.0

##[debug]Evaluating condition for step: 'Stop containers'
##[debug]Evaluating: always()
##[debug]Evaluating always:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Stop containers
Run '/home/runner/k8s/index.js'
##[debug]/home/runner/externals/node1[6](https://github.com/org/app/actions/runs/8047786994/job/22017788503#step:24:6)/bin/node /home/runner/k8s/index.js
Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug]System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug] ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'CleanupJob' did not execute successfully
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   --- End of inner exception stack trace ---
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.CleanupJobAsync(IExecutionContext context, List`1 containers)
##[debug]   at GitHub.Runner.Worker.ContainerOperationProvider.StopContainersAsync(IExecutionContext executionContext, Object data)
##[debug]   at GitHub.Runner.Worker.JobExtensionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Stop containers
panbanda commented 5 months ago

Unfortunately I'm also seeing it again as well so had to revert to dind for now. Same log output as yours on any job configuration.

rinoabraham commented 5 months ago

@nikola-jokic I am getting same below error when i am trying to hook and run Kanico in the workflow to build docker files on runners. Below is the workflow error.

Could you please suggest what is wrong here. If we are missing anything ?

Error: 2024-03-22T11:47:29.9068612Z Current runner version: '2.314.1' 2024-03-22T11:47:29.9079451Z Runner name: 'arc-run-5dlvn-runner-bjn42' 2024-03-22T11:47:29.9080441Z Runner group name: 'Default' 2024-03-22T11:47:29.9081580Z Machine name: 'arc-run-5dlvn-runner-bjn42' 2024-03-22T11:47:29.9086893Z ##[group]GITHUB_TOKEN Permissions 2024-03-22T11:47:29.9089313Z Contents: read 2024-03-22T11:47:29.9089918Z Metadata: read 2024-03-22T11:47:29.9090487Z Packages: write 2024-03-22T11:47:29.9091016Z ##[endgroup] 2024-03-22T11:47:29.9095012Z Secret source: Actions 2024-03-22T11:47:29.9095903Z Prepare workflow directory 2024-03-22T11:47:30.2661024Z Prepare all required actions 2024-03-22T11:47:30.2875268Z Complete job name: build 2024-03-22T11:47:30.4927254Z ##[group]Run '/home/runner/k8s/index.js' 2024-03-22T11:47:30.4947714Z shell: /home/runner/externals/node16/bin/node {0} 2024-03-22T11:47:30.4948705Z ##[endgroup] 2024-03-22T11:47:31.2058537Z ##[error]HttpError: HTTP request failed 2024-03-22T11:47:31.2081577Z ##[error]Process completed with exit code 1. 2024-03-22T11:47:31.2137395Z ##[error]Executing the custom container implementation failed. Please contact your self hosted runner administrator. 2024-03-22T11:47:31.2259430Z ##[group]Run '/home/runner/k8s/index.js' 2024-03-22T11:47:31.2261657Z shell: /home/runner/externals/node16/bin/node {0} 2024-03-22T11:47:31.2262326Z ##[endgroup] 2024-03-22T11:47:31.6693876Z ##[error]HttpError: HTTP request failed 2024-03-22T11:47:31.6773413Z ##[error]Process completed with exit code 1. 2024-03-22T11:47:31.6797815Z ##[error]Executing the custom container implementation failed. Please contact your self hosted runner administrator. 2024-03-22T11:47:31.7011999Z Cleaning up orphan processes

my workflow file as below:

name: 🧪 Test building with Kaniko
on:
  workflow_dispatch:
jobs:
  build:
    runs-on: arc-run # our new runner set
    container:
      image: gcr.io/kaniko-project/executor:v19.1-debug # the kaniko image
    permissions:
      contents: read # read the repository
      packages: write # push to GHCR, omit if not pushing to GitHub's container registry
    steps:
      - name: Build and push container test
        run: |
          # Write config file, change to your destination registry
          AUTH=$(echo -n ${{ github.actor }}:${{ secrets.GITHUB_TOKEN }} | base64)
          echo "{\"auths\": {\"ghcr.io\": {\"auth\": \"${AUTH}\"}}}" > /kaniko/.docker/config.json
          # Configure git
          export GIT_USERNAME="abcduser***"
          export GIT_PASSWORD="${{ secrets.PAT_TOKEN }}" # works for GHEC or GHES container registry
          # Build and push (sub in your image, of course)
          /kaniko/executor --dockerfile="./Dockerfile" --context=/workspace --skip-tls-verify \
            --context="${{ github.repositoryUrl }}#${{ github.ref }}#${{ github.sha }}" \
            --destination="ghcr.io/abncduser/kaniko-build:latest" \
            --push-retry 5 \
            --image-name-with-digest-file /workspace/image-digest.txt

Below is my values.yaml file

maxRunners: 3
minRunners: 1
  spec:
   initContainers: # needed to set permissions to use the PVC
    - name: kube-init
      image: ghcr.io/actions/actions-runner:latest
      command: ["sudo", "chown", "-R", "runner:runner", "/home/runner/_work"]
      volumeMounts:
      - name: work
        mountPath: /home/runner/_work
   containers:
      - name: runner
        image: ghcr.io/actions/actions-runner:latest
        command: ["/home/runner/run.sh"]
        env:
          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "true"  # allow non-container steps, makes life easier
        volumeMounts:
          - name: work
            mountPath: /home/runner/_work
containerMode:
  type: "kubernetes" 
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "k8s-mode"
    resources:
      requests:
        storage: 1Gi