ARC pod deletion issue - Githubissues

victoruzunovjuro commented 1 year ago

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.27.3

Helm Chart Version

0.23.2

CertManager Version

v1.11.1

Deployment Method

Helm

cert-manager installation

Yes.

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: github-runners-spot-large
  namespace: actions-runner-system
spec:
  template:
    metadata:
      annotations:
        app.kubernetes.io/name: github-runners
    spec:
      organization: <snip>
      labels:
        - large
      env:
        - name: ACTIONS_RUNNER_PRINT_LOG_TO_STDOUT
          value: "true"
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "80"
      terminationGracePeriodSeconds: 110
      imagePullPolicy: IfNotPresent
      image: summerwind/actions-runner:ubuntu-20.04
      tolerations:
        - key: "github.com/actions-runner-controller"
          operator: "Exists"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                - key: "github.com/actions-runner-controller"
                  operator: In
                  values:
                    - "spot"
      # Mount secrets so docker commands work
      containers:
        - name: runner
          volumeMounts:
            - name: docker-secret
              mountPath: "/home/runner/.docker_ro/"
          lifecycle:
            postStart:
              exec:
                command:
                  - "/bin/sh"
                  - "-c"
                  - "mkdir -p /home/runner/.docker && cp /home/runner/.docker_ro/config.json /home/runner/.docker/config.json"
        - name: docker
          volumeMounts:
            - name: docker-secret
              mountPath: "/home/runner/.docker/"
      volumes:
        - name: docker-secret
          secret:
            secretName: regcred
            items:
              - key: .dockerconfigjson
                path: config.json
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runners
  namespace: actions-runner-system
spec:
  scaleDownDelaySecondsAfterScaleOut: 120
  scaleTargetRef:
    kind: RunnerDeployment
    name: github-runners-spot-large
  minReplicas: 2
  maxReplicas: 50
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: "30m"
  scheduledOverrides:
    # Override minReplicas to 0 between 21:00-10:00 on Weekdays
    - startTime: "2023-01-02T20:00:00+00:00"
      endTime: "2023-01-03T07:00:00+00:00"
      recurrenceRule:
        frequency: Daily
      minReplicas: 0

    # Override the replicas to 0 on Weekends
    - startTime: "2023-01-07T07:00:00+00:00"
      endTime: "2023-01-08T20:00:00+00:00"
      recurrenceRule:
        frequency: Weekly
      minReplicas: 0

To Reproduce

I have not been able to reproduce it.

Describe the bug

We are using ARC to configure runners on Bottlerocket spot instances in EKS. We use the webhook based autoscaling approach.

We have noticed that sporadically pods will be deleted by ARC. Steps to we've taken to troubleshoot this:

The first thing was to assume it was a spot instance interrupt, but this does not seem to be the case. We are tracking spot instance interruption really closely. There are 2-3 spot instance interruptions a month while this "ARC pod deletion issue" is happening daily.
It is not resource usage as we've seen it happen on hosts that are using less than 30% of their memory.
Looking into the logs on github side. There wasn't a job completion webhook event send to ARC's webhook server.
Tracing the k8 api audit logs we see that while the job is running, arc sends a deletion requests to the k8s api.

Below I will provide you with the logs of one case - github-runners-spot-large-sw7cb-lx6sc

Here is an k8 api event at 2023-05-03T12:04:29.433+03:00 showing that ARC deleted the pod:

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Metadata",
  "auditID": "002876c0-ba06-4e1e-ada1-707a02af4fc6",
  "stage": "ResponseComplete",
  "requestURI": "/apis/actions.summerwind.dev/v1alpha1/namespaces/actions-runner-system/runners/github-runners-spot-large-sw7cb-lx6sc",
  "verb": "delete",
  "user": {
    "username": "system:serviceaccount:actions-runner-system:actions-runner-controller",
    "uid": "03f0e603-7564-44eb-9ac9-44d983533e01",
    "groups": [
      "system:serviceaccounts",
      "system:serviceaccounts:actions-runner-system",
      "system:authenticated"
    ],
    "extra": {
      "authentication.kubernetes.io/pod-name": [
        "actions-runner-controller-5797557479-chm7n"
      ],
      "authentication.kubernetes.io/pod-uid": [
        "03d8bf75-ab3f-470b-b8b6-d83681d79e53"
      ]
    }
  },
  "sourceIPs": [
    "10.0.50.116"
  ],
  "userAgent": "manager/v0.0.0 (linux/amd64) kubernetes/$Format",
  "objectRef": {
    "resource": "runners",
    "namespace": "actions-runner-system",
    "name": "github-runners-spot-large-sw7cb-lx6sc",
    "apiGroup": "actions.summerwind.dev",
    "apiVersion": "v1alpha1"
  },
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestReceivedTimestamp": "2023-05-03T09:04:29.374365Z",
  "stageTimestamp": "2023-05-03T09:04:29.383214Z",
  "annotations": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"actions-runner-controller-manager\" of ClusterRole \"actions-runner-controller-manager\" to ServiceAccount \"actions-runner-controller/actions-runner-system\""
  }
}

Describe the expected behavior

It's expected ARC not to delete pods while they are running a job.

Whole Controller Logs

This is a snippet of the controller/webhook server logs at the time of deletion of pod:

2023-05-03T09:04:29ZINFOrunnerpodRunner pod is marked as already unregistered.{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-z6bp7"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:29ZINFOrunnerpodRunner pod is annotated to wait for completion, and the runner container is not restarting{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:29ZINFOrunnerpodRunner pod has been stopped with a successful status.{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-z6bp7"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:29ZDEBUGrunnerpodFailed to unregister runner before deleting the pod.{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc", "error": "failed to remove runner: DELETE https://api.github.com/orgs/joteam/actions/runners/56231: 422 Bad request - Runner \"github-runners-spot-large-sw7cb-lx6sc\" is still running a job\" []"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:29ZINFOrunnerRemoved finalizer{"runner": "actions-runner-system/github-runners-spot-large-sw7cb-z6bp7"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:29ZINFOrunnerRemoved finalizer{"runner": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:29ZINFOrunnerpodThis runner pod seems to have been deleted directly, bypassing the parent Runner resource. Marking the runner for deletion to not let it recreate this pod.{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03T09:04:28ZINFOrunnerpodRunner pod is annotated to wait for completion, and the runner container is not restarting{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-cpq2r"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal


### Whole Runner Pod Logs

```shell
Pod logs + runner logs t the time of deletion of pod:

2023-05-03 12:04:30 
[RUNNER 2023-05-03 09:04:30Z INFO HostContext] No proxy settings were found based on environmental variables (http_proxy/https_proxy/HTTP_PROXY/HTTPS_PROXY)
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
[WORKER 2023-05-03 09:04:29Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03T09:04:29ZINFOrunnerpodRunner pod is annotated to wait for completion, and the runner container is not restarting{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03T09:04:29ZDEBUGrunnerpodFailed to unregister runner before deleting the pod.{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc", "error": "failed to remove runner: DELETE https://api.github.com/orgs/jteam/actions/runners/56231: 422 Bad request - Runner \"github-runners-spot-large-sw7cb-lx6sc\" is still running a job\" []"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03 09:04:29.500 NOTICE --- Observed that the runner has been registered.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03 09:04:29.497 NOTICE --- Waiting for the runner to register first.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
/runner /
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
5f21db279153 mongo:4.4 "docker-entrypoint.s…" 6 minutes ago Up 5 minutes 0.0.0.0:27017->27017/tcp, :::27017->27017/tcp d9284e3fa27c4a209aaa94e54a0a6df5_mongo44_010171
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
d1a0abbb7ffa redis:6.2-alpine "docker-entrypoint.s…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:6379->6379/tcp, :::6379->6379/tcp 8998acee1d7b4c5e8df7e825a433b108_redis62alpine_d1eb3a
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
82c60b470442 bitnami/redis-sentinel:latest "/opt/bitnami/script…" 5 minutes ago Up 5 minutes (healthy) 0.0.0.0:26379->26379/tcp, :::26379->26379/tcp a530dfb127b8430686b56455a2d602a8_bitnamiredissentinellatest_aceb88
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
Prestop hook started
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
[WORKER 2023-05-03 09:04:29Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03 09:04:29.409 NOTICE --- Ensuring dockerd is still running.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03T09:04:29ZINFOrunnerRemoved finalizer{"runner": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03 09:04:29.402 NOTICE --- Note that if this takes more time than terminationGracePeriodSeconds, the runner will be forcefully terminated by Kubernetes, which may result in the in-progress workflow job, if any, to fail.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03 09:04:29.397 NOTICE --- Executing actions-runner-controller's SIGTERM handler.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:29 
2023-05-03T09:04:29ZINFOrunnerpodThis runner pod seems to have been deleted directly, bypassing the parent Runner resource. Marking the runner for deletion to not let it recreate this pod.{"runnerpod": "actions-runner-system/github-runners-spot-large-sw7cb-lx6sc"}
pod=actions-runner-controller-5797557479-chm7n
node=ip-10-0-60-36.eu-west-1.compute.internal

2023-05-03 12:04:28 
[WORKER 2023-05-03 09:04:28Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:28 
[WORKER 2023-05-03 09:04:28Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:27 
[WORKER 2023-05-03 09:04:27Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:25 
[WORKER 2023-05-03 09:04:25Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:24 
[WORKER 2023-05-03 09:04:24Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:23 
[WORKER 2023-05-03 09:04:23Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:23 
[WORKER 2023-05-03 09:04:23Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:22 
[WORKER 2023-05-03 09:04:22Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:22 
[WORKER 2023-05-03 09:04:22Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:21 
[WORKER 2023-05-03 09:04:21Z INFO JobServerQueue] Try to append 2 batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: 2/2.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:21 
[WORKER 2023-05-03 09:04:21Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:21 
[WORKER 2023-05-03 09:04:21Z INFO HostContext] Well known directory 'Work': '/runner/_work'
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:21 
[WORKER 2023-05-03 09:04:21Z INFO HostContext] Well known directory 'Root': '/runner'
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:21 
[WORKER 2023-05-03 09:04:21Z INFO HostContext] Well known directory 'Bin': '/runner/bin'
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:20 
[WORKER 2023-05-03 09:04:20Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal

2023-05-03 12:04:20 
[WORKER 2023-05-03 09:04:20Z INFO JobServerQueue] Try to append *** batches web console lines for record '9983e0***b-b632-59ce-2f72-2e0ec0267c05', success rate: ***/***.
pod=github-runners-spot-large-sw7cb-lx6sc
node=ip-10-0-12-36.eu-west-1.compute.internal



### Additional Context

_No response_

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

mar-pan commented 1 year ago

Im facing the same issue :/

hontarenko commented 1 year ago

I'm also facing the same problem

yuvalshi0 commented 1 year ago

same here

MPV commented 1 year ago

Related to this upstream issue?

https://github.com/actions/runner/issues/2582

S-G0D commented 12 months ago

I am facing this exact problem. There is no saying when it happens exactly. I have seen this happening on long running jobs ~1hour. Any update on fix please? It is related to this issue: https://github.com/actions/actions-runner-controller/issues/2882

victoruzunovjuro commented 11 months ago

Related to this upstream issue?

* [The linux runner does not gracefully shutdown on SIGINT runner#2582](https://github.com/actions/runner/issues/2582)

This seems like it's not entirely relevant. We are talking about the controller sending an api call to k8s to kill the runner pod for w/e reason.

Loganathan-Murugesan commented 10 months ago

I am facing similar issue. Exactly after 1 hour the runner is getting terminated though the job is running and getting below error.
The self-hosted runner: xxxxxx lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

AmitBenAmi commented 10 months ago

I am facing the same issue. The runner pod died randomly making the UI hang for 10 minutes and then fail to workflow.

There is

Related to this upstream issue?

The linux runner does not gracefully shutdown on SIGINT runner#2582

This issue and possible fix just add the graceful shutdown, making it not hang for 10 minutes but allowing enough time to make sure that the runner exits gracefully and shows that the run was canceled (instead of hanging). I updated my runner to add this and manually deleted the pod, and I saw that the job is canceled rather than hanging, so it isn't related to the root cause as mentioned by @victoruzunovjuro.

I also tried to make sure it isn't being scaled down due to cluster-autoscaler, and made sure it wasn't any spot instance termination, but unfortunately I can't see the audit logs for the pod really being terminated by the actions runner controller itself (I do not have the audit logs)

Mahmoudkassry commented 4 months ago

This has started to happen with me after upgrading the runners image to 2.316.0, rolling back to 2.314.1 might fix your issue until they release the fix

zakariais commented 3 weeks ago

@victoruzunovjuro does the lifecyle hook work? Im trying to make the preStop hook work but it doesn't work at all. The runner just gets killed and there is no lifecyle hook execution, even tried with sidecar containers but same issue. Is it related to that sigterm is not received?

actions / actions-runner-controller

ARC pod deletion issue #2561

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs