actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.76k stars 1.12k forks source link

Lost communication with the server due to the GitHub API returning a BadRequest or Forbidden error #3519

Closed Thiry1 closed 6 months ago

Thiry1 commented 6 months ago

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

To Reproduce

The issue occurs randomly, so a specific reproduction method has not been identified.

Describe the bug

The following error is displayed in the GitHub Actions execution log:

The self-hosted runner: runner-bgkcw-runner-rh47x lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

When checking the runner's pod logs, I found the following two errors:

[RUNNER 2024-05-15 07:40:21Z ERR  GitHubActionsService] GET request to https://pipelinesghubeus10.actions.githubusercontent.com/XXX/_apis/distributedtask/pools/1/messages?sessionId=34d4403f-0a18-48ad-a541-f773c0e60b88&status=Online&runnerVersion=2.316.1&os=Linux&architecture=X64&disableUpdate=true failed. HTTP Status: Forbidden
[RUNNER 2024-05-15 07:40:29Z ERR  GitHubActionsService] POST request to https://pipelinesghubeus10.actions.githubusercontent.com/XXX/_apis/oauth2/token failed. HTTP Status: BadRequest

These errors occur simultaneously in jobs running on different nodes at the same time.

Describe the expected behavior

The job completes without errors.

Additional Context

ghaRunnerScaleSetValues:
  runnerScaleSetName: "XXX"
  githubConfigUrl: "https://github.com/XXX"
  githubConfigSecret: "github-apps-secret"
  maxRunners: 8
  minRunners: 1
  containerMode:
    type: "kubernetes"
    kubernetesModeWorkVolumeClaim:
      accessModes:
        - "ReadWriteOnce"
      storageClassName: "gp3"
      resources:
        requests:
          storage: "10Gi"
    kubernetesModeServiceAccount:
      annotations: null
  template:
    spec:
      initContainers:
        - name: "kube-init"
          image: "XXX"
          command: ["sudo", "chown", "-R", "1001:1001", "/home/runner/_work"]
          volumeMounts:
            - name: "work"
              mountPath: "/home/runner/_work"
      containers:
        - name: "runner"
          image: "XXX"
          command: ["/bin/bash", "-c"]
          args: ["sudo nohup containerd & sudo nohup buildkitd & /home/runner/run.sh"]
          resources:
            requests:
              cpu: 5
              memory: "16Gi"
            limits:
              cpu: 8
              memory: "32Gi"
          env:
            - name: "ACTIONS_RUNNER_CONTAINER_HOOKS"
              value: "/home/runner/k8s/index.js"
            - name: "ACTIONS_RUNNER_POD_NAME"
              valueFrom:
                fieldRef:
                  fieldPath: "metadata.name"
            - name: "ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER"
              value: "false"
          volumeMounts:
            - name: "work"
              mountPath: "/home/runner/_work"
          securityContext:
            privileged: true
      volumes:
        - name: "work"
          ephemeral:
            volumeClaimTemplate:
              spec:
                accessModes:
                  - "ReadWriteOnce"
                storageClassName: "gp3"
                resources:
                  requests:
                    storage: "10Gi"

Controller Logs

https://gist.github.com/Thiry1/ab048d389e8801946dd85ff4b221bffd

Runner Pod Logs

https://gist.github.com/Thiry1/6ab73735da6d1202dd0000243b04c953
github-actions[bot] commented 6 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

cb-krishnapatel commented 6 months ago

Hi, im also facing similar issue here with error :- Http response code: NotFound from 'POST https://api.github.com/actions/runner-registration' Due to this runner got shutdown all together. Can someone help us out here? Also a speculation I'm using github app for authentication, and the generated token used gets expired in few hrs. Could this be the reason?

ps78674 commented 6 months ago

We have the same problem. After last step in the job, runner sends HTTP DELETE to GH API and (sometimes) gets 403 status. [RUNNER 2024-05-16 11:51:53Z ERR GitHubActionsService] DELETE request to https://pipelinesghubeus22.actions.githubusercontent.com/XXX/_apis/distributedtask/pools/1/sessions/55495043-e955-4249-88c6-6b68672da670 failed. HTTP Status: Forbidden

nikola-jokic commented 6 months ago

Hey everyone, are you still experiencing the problem? Maybe it was a temporary error. @ps78674 the error you posted is definitely a temporary error, and is fixed in 0.9.2 :relaxed:

Thiry1 commented 6 months ago

@nikola-jokic

Yes, the error has persisted from the time I created this issue until today. Unfortunately, I migrated to CodeBuild a few hours ago to circumvent this problem. Therefore, it is difficult for me to provide additional information on this issue.

Since many other people do not seem to be experiencing this problem, it might be an issue specific to my environment. Though it may not be relevant, let me explain my configuration: I set up a cluster on AWS EKS, used Karpenter for auto-scaling, and ran ARC runners on EC2 spot instances. Since there are no logs indicating forced termination of spot instances by AWS, it does not seem to be a problem with the spot instances.

Another possible factor might be that I am using ARC for an organization that has a GitHub Enterprise Cloud contract.

joaoluiznaufel commented 6 months ago

check if the runners have a sidecar or a init container. This can have side effects for the runner when it's try to communicate with github. Init containers can be setup by some operator, for example, dynatrace. :)

nikola-jokic commented 6 months ago

Since it seems like other components that are not related to ARC are causing this issue, and the forbidden error is fixed in the latest release, I will close this issue. And please submit another issue if you find the root cause or confirm that ARC is causing this problem.