actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.52k stars 1.07k forks source link

SummerWind Webhook autoscaler kills running jobs when `runs-on` is ambiguous #2798

Open Nuru opened 1 year ago

Nuru commented 1 year ago

One consequence of this bug is that runners can be arbitrarily killed while running jobs. It happens in variations on a pattern like this:

End result: 3 jobs killed, 3 unwanted runners running idle.

Checks

Controller Version

0.27.4 (still happens with 0.27.6)

Helm Chart Version

0.23.3 (still happens with 0.23.7)

CertManager Version

1.10.2

Deployment Method

Helm

cert-manager installation

I'm certain cert-manager is working properly, we use it for other things.

Checks

Resource Definitions

HorizontalRunnerAutoscaler ```yaml apiVersion: v1 items: - apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: annotations: meta.helm.sh/release-name: infra-runner-amd64-large meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-04-21T20:22:36Z" generation: 1429 labels: app.kubernetes.io/managed-by: Helm k8slens-edit-resource-version: v1alpha1 name: infra-runner-amd64-large namespace: actions-runner-system resourceVersion: "98605291" uid: ea9a0690-e332-4970-8262-a2bf8df39b68 spec: maxReplicas: 8 minReplicas: 0 scaleDownDelaySecondsAfterScaleOut: 300 scaleTargetRef: name: infra-runner-amd64-large scaleUpTriggers: - amount: 1 duration: 3h60m githubEvent: workflowJob: {} status: desiredReplicas: 0 lastSuccessfulScaleOutTime: "2023-08-05T23:10:58Z" - apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: annotations: meta.helm.sh/release-name: infra-runner-amd64-medium meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-04-19T23:54:18Z" generation: 1174 labels: app.kubernetes.io/managed-by: Helm k8slens-edit-resource-version: v1alpha1 name: infra-runner-amd64-medium namespace: actions-runner-system resourceVersion: "98607727" uid: f7aa1d23-c8ff-422d-8eb9-7df1eb66f454 spec: maxReplicas: 8 minReplicas: 0 scaleDownDelaySecondsAfterScaleOut: 300 scaleTargetRef: name: infra-runner-amd64-medium scaleUpTriggers: - amount: 1 duration: 60m githubEvent: workflowJob: {} status: desiredReplicas: 0 lastSuccessfulScaleOutTime: "2023-08-06T01:00:36Z" - apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: annotations: meta.helm.sh/release-name: infra-runner-amd64-small meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-04-19T23:54:18Z" generation: 6427 labels: app.kubernetes.io/managed-by: Helm name: infra-runner-amd64-small namespace: actions-runner-system resourceVersion: "98602624" uid: 37cd150c-7509-46a4-a575-0084417fd9cc spec: capacityReservations: - effectiveTime: "2023-08-06T00:21:14Z" expirationTime: "2023-08-06T00:51:14Z" replicas: 1 maxReplicas: 5 minReplicas: 1 scaleDownDelaySecondsAfterScaleOut: 300 scaleTargetRef: name: infra-runner-amd64-small scaleUpTriggers: - amount: 1 duration: 30m githubEvent: workflowJob: {} status: desiredReplicas: 1 lastSuccessfulScaleOutTime: "2023-08-06T00:21:14Z" - apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: annotations: meta.helm.sh/release-name: infra-runner-arm64 meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-03-28T03:21:51Z" generation: 4437 labels: app.kubernetes.io/managed-by: Helm name: infra-runner-arm64 namespace: actions-runner-system resourceVersion: "98588411" uid: 3e524b3a-5b26-4059-a22d-ec713ff309d2 spec: maxReplicas: 128 minReplicas: 0 scaleDownDelaySecondsAfterScaleOut: 300 scaleTargetRef: name: infra-runner-arm64 scaleUpTriggers: - amount: 1 duration: 45m githubEvent: workflowJob: {} status: desiredReplicas: 0 lastSuccessfulScaleOutTime: "2023-08-06T00:03:44Z" kind: List metadata: resourceVersion: "" ```
RunnerDeployments ```yaml apiVersion: v1 items: - apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: annotations: meta.helm.sh/release-name: infra-runner-amd64-large meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-04-21T20:22:36Z" generation: 1372 labels: app.kubernetes.io/managed-by: Helm name: infra-runner-amd64-large namespace: actions-runner-system resourceVersion: "98605451" uid: 3a94f97e-4fe6-4d3a-b73d-77dd649fa813 spec: effectiveTime: "2023-08-05T08:03:29Z" replicas: 0 template: metadata: annotations: karpenter.sh/do-not-evict: "true" spec: dockerdWithinRunnerContainer: true env: - name: RUNNER_GRACEFUL_STOP_TIMEOUT value: "90" group: amd64-large image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind:v2.307.1-ubuntu-20.04 imagePullPolicy: IfNotPresent labels: - self-hosted - Linux - linux - Ubuntu - ubuntu - X64 - x64 - x86_64 - amd64 - AMD64 - large nodeSelector: kubernetes.io/arch: amd64 kubernetes.io/os: linux organization: my-organization resources: limits: cpu: 6000m memory: 7680Mi requests: cpu: 4000m memory: 7680Mi serviceAccountName: actions-runner terminationGracePeriodSeconds: 100 volumeMounts: - mountPath: /home/runner/work/shared name: shared-volume volumes: - name: shared-volume persistentVolumeClaim: claimName: infra-runner-amd64-large status: availableReplicas: 0 desiredReplicas: 0 readyReplicas: 0 replicas: 0 updatedReplicas: 0 - apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: annotations: meta.helm.sh/release-name: infra-runner-amd64-medium meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-04-19T23:54:18Z" generation: 1064 labels: app.kubernetes.io/managed-by: Helm name: infra-runner-amd64-medium namespace: actions-runner-system resourceVersion: "98607782" uid: ebef8bc4-0e74-437c-b139-9b1ac083815e spec: effectiveTime: "2023-08-06T01:00:36Z" replicas: 0 template: metadata: annotations: karpenter.sh/do-not-evict: "true" spec: dockerdWithinRunnerContainer: true env: - name: RUNNER_GRACEFUL_STOP_TIMEOUT value: "90" image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind:v2.307.1-ubuntu-20.04 imagePullPolicy: IfNotPresent labels: - self-hosted - Linux - linux - Ubuntu - ubuntu - X64 - x64 - x86_64 - amd64 - AMD64 - core-auto - medium nodeSelector: kubernetes.io/arch: amd64 kubernetes.io/os: linux organization: my-organization resources: limits: cpu: 3000m memory: 3072Mi requests: cpu: 1500m memory: 1536Mi serviceAccountName: actions-runner terminationGracePeriodSeconds: 100 volumeMounts: - mountPath: /home/runner/work/shared name: shared-volume volumes: - name: shared-volume persistentVolumeClaim: claimName: infra-runner-amd64-medium status: availableReplicas: 0 desiredReplicas: 0 readyReplicas: 0 replicas: 0 updatedReplicas: 0 - apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: annotations: meta.helm.sh/release-name: infra-runner-amd64-small meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-04-19T23:54:18Z" generation: 4566 labels: app.kubernetes.io/managed-by: Helm name: infra-runner-amd64-small namespace: actions-runner-system resourceVersion: "98605914" uid: 07b9d677-11e5-488f-aa57-ddd41e3891a9 spec: effectiveTime: "2023-08-06T00:21:14Z" replicas: 1 template: spec: dockerdWithinRunnerContainer: true env: - name: RUNNER_GRACEFUL_STOP_TIMEOUT value: "90" image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind:v2.307.1-ubuntu-20.04 imagePullPolicy: IfNotPresent labels: - self-hosted - Linux - linux - Ubuntu - ubuntu - X64 - x64 - x86_64 - amd64 - AMD64 - core-auto - common - default - small nodeSelector: kubernetes.io/arch: amd64 kubernetes.io/os: linux organization: my-organization resources: limits: cpu: 1000m memory: 1024Mi requests: cpu: 500m memory: 256Mi serviceAccountName: actions-runner terminationGracePeriodSeconds: 100 volumeMounts: - mountPath: /home/runner/work/shared name: shared-volume volumes: - name: shared-volume persistentVolumeClaim: claimName: infra-runner-amd64-small status: availableReplicas: 1 desiredReplicas: 1 readyReplicas: 1 replicas: 1 updatedReplicas: 1 - apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: annotations: meta.helm.sh/release-name: infra-runner-arm64 meta.helm.sh/release-namespace: actions-runner-system creationTimestamp: "2023-03-28T03:21:51Z" generation: 3836 labels: app.kubernetes.io/managed-by: Helm name: infra-runner-arm64 namespace: actions-runner-system resourceVersion: "98604888" uid: 3eb2f283-ddd1-4a03-b244-4ff65031c03f spec: effectiveTime: "2023-08-06T00:03:44Z" replicas: 0 template: metadata: annotations: karpenter.sh/do-not-evict: "true" spec: dockerdWithinRunnerContainer: true env: - name: RUNNER_GRACEFUL_STOP_TIMEOUT value: "90" group: armEnabled image: ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind:v2.307.1-ubuntu-20.04 imagePullPolicy: IfNotPresent labels: - self-hosted - Linux - linux - Ubuntu - ubuntu - arm64 - ARM64 - aarch64 - core-auto - small - medium - large - packages nodeSelector: kubernetes.io/arch: arm64 kubernetes.io/os: linux organization: my-organization resources: limits: cpu: 2000m memory: 2048Mi requests: cpu: 250m memory: 512Mi serviceAccountName: actions-runner terminationGracePeriodSeconds: 100 tolerations: - effect: NoSchedule key: kubernetes.io/arch operator: Equal value: arm64 volumeMounts: - mountPath: /home/runner/work/shared name: shared-volume volumes: - name: shared-volume persistentVolumeClaim: claimName: infra-runner-arm64 status: availableReplicas: 0 desiredReplicas: 0 readyReplicas: 0 replicas: 0 updatedReplicas: 0 kind: List metadata: resourceVersion: "" ```

To Reproduce

Describe the bug

When a job's run-on spec matches multiple runner deployments, and HRAs are using webhook-based autoscaling. The HRA will unpredictably pick one of the default deployments to scale up or down, although the job may in fact be picked up by any deployment.

If there is only one RunnerDeployment in the default group, then I expect (have not tested it) that it will be that group that is consistently scaled up and down, but again, it will not necessarily be that group that actually gets assigned the job.

Describe the expected behavior

If a job matches multiple runner deployments, the HRA should, at a minimum, consistently pick the same deployment to scale up and scale down. This way, if no deployments have idle runners, autoscaling should work acceptably, as jobs would get picked up by the deployment being scaled up, and that deployment would be scaled down when jobs complete.

Ideally, when a job is completed, the HRA would match the job ID to a specific Pod and capacity reservation, delete the capacity reservation and scale down the deployment it is in.

Whole Controller Logs

Note that the job ran on infra-runner-amd64-small but the HRA scaled infra-runner-amd64-medium.

Whole Controller Logs ```shell 2023-08-06T00:58:03Z INFO -github-webhook-secret-token and GITHUB_WEBHOOK_SECRET_TOKEN are missing or empty. Create one following https://docs.github.com/en/developers/webhooks-and-events/securing-your-webhooks and specify it via the flag or the envvar 2023-08-06T00:58:03Z INFO -watch-namespace is %q. Only HorizontalRunnerAutoscalers in %q are watched, cached, and considered as scale targets. {"actions-runner-system": "actions-runner-system"} 2023-08-06T00:58:04Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"} 2023-08-06T00:58:04Z INFO starting webhook server 2023-08-06T00:58:04Z INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"} 2023-08-06T00:58:04Z INFO Starting EventSource {"controller": "webhookbasedautoscaler", "controllerGroup": "actions.summerwind.dev", "controllerKind": "HorizontalRunnerAutoscaler", "source": "kind source: *v1alpha1.HorizontalRunnerAutoscaler"} 2023-08-06T00:58:04Z INFO Starting Controller {"controller": "webhookbasedautoscaler", "controllerGroup": "actions.summerwind.dev", "controllerKind": "HorizontalRunnerAutoscaler"} 2023-08-06T00:58:04Z INFO Starting workers {"controller": "webhookbasedautoscaler", "controllerGroup": "actions.summerwind.dev", "controllerKind": "HorizontalRunnerAutoscaler", "worker count": 1} 2023-08-06T01:00:32Z DEBUG controllers.webhookbasedautoscaler Found 0 HRAs by key {"key": "my-organization/action-test"} 2023-08-06T01:00:32Z DEBUG controllers.webhookbasedautoscaler Found some runner groups are managed by ARC {"event": "workflow_job", "hookID": "386548876", "delivery": "a47ff4b0-33f4-11ee-9c98-cc4bfd7f2014", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Custom, Name:armEnabled}, RunnerGroup{Scope:Organization, Kind:Custom, Name:amd64-large}"} 2023-08-06T01:00:33Z DEBUG controllers.webhookbasedautoscaler Searching in runner groups {"event": "workflow_job", "hookID": "386548876", "delivery": "a47ff4b0-33f4-11ee-9c98-cc4bfd7f2014", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Custom, Name:armEnabled}, RunnerGroup{Scope:Organization, Kind:Custom, Name:amd64-large}"} 2023-08-06T01:00:33Z DEBUG controllers.webhookbasedautoscaler groups {"event": "workflow_job", "hookID": "386548876", "delivery": "a47ff4b0-33f4-11ee-9c98-cc4bfd7f2014", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Custom, Name:armEnabled}, RunnerGroup{Scope:Organization, Kind:Custom, Name:amd64-large}"} 2023-08-06T01:00:33Z DEBUG controllers.webhookbasedautoscaler Found 2 HRAs by key {"key": "my-organization"} 2023-08-06T01:00:33Z DEBUG controllers.webhookbasedautoscaler job scale up target found {"event": "workflow_job", "hookID": "386548876", "delivery": "a47ff4b0-33f4-11ee-9c98-cc4bfd7f2014", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "enterprise": "", "organization": "my-organization", "repository": "action-test", "key": "my-organization"} 2023-08-06T01:00:33Z INFO controllers.webhookbasedautoscaler scaled infra-runner-amd64-medium by 1 {"event": "workflow_job", "hookID": "386548876", "delivery": "a47ff4b0-33f4-11ee-9c98-cc4bfd7f2014", "workflowJob.status": "queued", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "queued", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127} 2023-08-06T01:00:33Z INFO controllers.webhookbasedautoscaler Starting batch worker 2023-08-06T01:00:36Z DEBUG controllers.webhookbasedautoscaler Patching hra infra-runner-amd64-medium for capacityReservations update {"before": 0, "expired": -1, "added": 1, "completed": 0, "after": 1} 2023-08-06T01:00:40Z DEBUG controllers.webhookbasedautoscaler Found 0 HRAs by key {"key": "my-organization/action-test"} 2023-08-06T01:00:40Z DEBUG controllers.webhookbasedautoscaler Found some runner groups are managed by ARC {"event": "workflow_job", "hookID": "386548876", "delivery": "a942fb00-33f4-11ee-9d84-8c604b1056fa", "workflowJob.status": "completed", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Custom, Name:amd64-large}, RunnerGroup{Scope:Organization, Kind:Custom, Name:armEnabled}"} 2023-08-06T01:00:40Z DEBUG controllers.webhookbasedautoscaler Searching in runner groups {"event": "workflow_job", "hookID": "386548876", "delivery": "a942fb00-33f4-11ee-9d84-8c604b1056fa", "workflowJob.status": "completed", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Custom, Name:armEnabled}, RunnerGroup{Scope:Organization, Kind:Custom, Name:amd64-large}"} 2023-08-06T01:00:40Z DEBUG controllers.webhookbasedautoscaler groups {"event": "workflow_job", "hookID": "386548876", "delivery": "a942fb00-33f4-11ee-9d84-8c604b1056fa", "workflowJob.status": "completed", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "groups": "RunnerGroup{Scope:Organization, Kind:Default, Name:}, RunnerGroup{Scope:Organization, Kind:Custom, Name:armEnabled}, RunnerGroup{Scope:Organization, Kind:Custom, Name:amd64-large}"} 2023-08-06T01:00:40Z DEBUG controllers.webhookbasedautoscaler Found 2 HRAs by key {"key": "my-organization"} 2023-08-06T01:00:40Z DEBUG controllers.webhookbasedautoscaler job scale up target found {"event": "workflow_job", "hookID": "386548876", "delivery": "a942fb00-33f4-11ee-9d84-8c604b1056fa", "workflowJob.status": "completed", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127, "enterprise": "", "organization": "my-organization", "repository": "action-test", "key": "my-organization"} 2023-08-06T01:00:40Z INFO controllers.webhookbasedautoscaler scaled infra-runner-amd64-medium by -1 {"event": "workflow_job", "hookID": "386548876", "delivery": "a942fb00-33f4-11ee-9d84-8c604b1056fa", "workflowJob.status": "completed", "workflowJob.labels": ["self-hosted"], "repository.name": "action-test", "repository.owner.login": "my-organization", "repository.owner.type": "Organization", "enterprise.slug": "", "action": "completed", "workflowJob.runID": 5773730087, "workflowJob.ID": 15649994127} 2023-08-06T01:00:42Z DEBUG controllers.webhookbasedautoscaler Patching hra infra-runner-amd64-medium for capacityReservations update {"before": 1, "expired": 1, "added": 0, "completed": -1, "after": 0} ```

Whole Runner Pod Logs

No relevant pod logs
Nuru commented 2 months ago

@nikola-jokic @mumoshu Note that this is an issue for the Summerwind controller. It still applies to summerwind/actions-runner-controller:v0.27.6.