actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.68k stars 1.11k forks source link

Autoscaler now creating a runnerset-per-pod/replica after 0.22.3 #1368

Closed graemechristie closed 2 years ago

graemechristie commented 2 years ago

Describe the bug I've just installed 0.22.3 - and we now seems to be getting multiple runnersets created (each with a name appended with a short hash) - each with exactly 1 replica rather than a single runnerset with many replicas.

Checks

To Reproduce Steps to reproduce the behavior:

  1. Install Actions runner controller v0.22.3, deploy a runnerset CRD with > 1 replicas

Expected behavior A single runnerset (ideally without a hash appended to the name) is created, which scales from min replicas to max replicas

Screenshots

image

image

Environment (please complete the following information):

Additional context

I am guessing the change was introduced somewhere as part of this fairly major commit ...

https://github.com/actions-runner-controller/actions-runner-controller/commit/14a878bfae5fff0da168829951538204617ee663

Here's the CRD for the selfhosted-autoscaling-syd-runnerset pictured above

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  annotations:
    meta.helm.sh/release-name: selfhosted-autoscaling
    meta.helm.sh/release-namespace: actions-runner-system
  creationTimestamp: '2022-01-05T12:08:26Z'
  generation: 9
  labels:
    app.kubernetes.io/managed-by: Helm
  managedFields:
    - apiVersion: actions.summerwind.dev/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          f:replicas: {}
        f:status:
          .: {}
          f:availableReplicas: {}
          f:desiredReplicas: {}
          f:readyReplicas: {}
          f:replicas: {}
          f:updatedReplicas: {}
      manager: manager
      operation: Update
      time: '2022-01-07T04:31:31Z'
    - apiVersion: actions.summerwind.dev/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/managed-by: {}
        f:spec:
          .: {}
          f:dockerdWithinRunnerContainer: {}
          f:ephemeral: {}
          f:image: {}
          f:labels: {}
          f:organization: {}
          f:selector:
            .: {}
            f:matchLabels:
              .: {}
              f:app: {}
          f:serviceName: {}
          f:template:
            .: {}
            f:metadata:
              .: {}
              f:annotations:
                .: {}
                f:io.kubernetes.cri-o.userns-mode: {}
              f:labels:
                .: {}
                f:app: {}
            f:spec:
              .: {}
              f:containers: {}
              f:nodeSelector:
                .: {}
                f:agentpool: {}
              f:runtimeClassName: {}
          f:workDir: {}
      manager: helm
      operation: Update
      time: '2022-01-07T04:54:49Z'
  name: selfhosted-autoscaling-syd-runnerset
  namespace: actions-runner-system
  resourceVersion: '85748453'
  uid: ac676a3a-2781-4adb-a4a1-077b6b0466d2
  selfLink: >-
    /apis/actions.summerwind.dev/v1alpha1/namespaces/actions-runner-system/runnersets/selfhosted-autoscaling-syd-runnerset
status:
  availableReplicas: 6
  desiredReplicas: 6
  readyReplicas: 6
  replicas: 6
  updatedReplicas: 6
spec:
  dockerdWithinRunnerContainer: true
  ephemeral: false
  image: bglmelnpegensp790.azurecr.io/selfhostedrunners/generic:v2.4
  labels:
    - self-hosted
    - Linux
    - X64
    - generic
  organization: Bunnings-Technology-Delivery
  replicas: 6
  selector:
    matchLabels:
      app: selfhosted-autoscaling-syd-runnerset
  serviceName: selfhosted-autoscaling-syd
  template:
    metadata:
      annotations:
        io.kubernetes.cri-o.userns-mode: auto:size=65536
      labels:
        app: selfhosted-autoscaling-syd-runnerset
    spec:
      containers:
        - env:
            - name: GH_RUNNER_LABELS
              value: generic
          imagePullPolicy: IfNotPresent
          name: runner
          resources:
            limits:
              cpu: 1000m
              memory: 2048Mi
            requests:
              cpu: 250m
              memory: 512Mi
      nodeSelector:
        agentpool: btd1
      runtimeClassName: sysbox-runc
  workDir: /tmp/_work
mumoshu commented 2 years ago

@graemechristie Hey. It's expected behavior since 0.22.0! Please refer to https://github.com/actions-runner-controller/actions-runner-controller/blob/master/docs/releasenotes/0.22.md

graemechristie commented 2 years ago

I don't understand - how could this be expected behaviour ? Or am I misunderstanding .. we now have a 1 StatefulSet per pod ? The whole avantage of useing stateful sets is that you get consistant pod naming for reploca allowing you to predictably re-bind PVC's and other resources ... what possible reason could there be for moving to a stateful set per replica ?

mumoshu commented 2 years ago

Then try https://github.com/actions-runner-controller/actions-runner-controller/pull/1167

we now have a 1 StatefulSet per pod ?

Yes.

The whole avantage of useing stateful sets is that you get consistant pod naming for reploca allowing you to predictably re-bind PVC's and other resources

Yeah, but on the other hand, that resulted in runners to race with GitHub Actions, sometimes workflow jobs are cancelled prematurely. We should never reuse the same runner pod name(which corresponds to the name of the actions runner registered to GitHub Actions), hence statefulset-per-runner.

We're re-implementing the ability to rebind PVs via https://github.com/actions-runner-controller/actions-runner-controller/pull/1340

graemechristie commented 2 years ago

I'm lost for words - this is insane. In what universe would anyone want a statefulset which only ever has one replica .. Your effectively not using Statefulsets for anything that would be considered their original purpose .. they are an odd wrapper for a container.

You may as well just remove the entire Runnerset implementation - I can't see what it achieves.

mumoshu commented 2 years ago

@graemechristie I disagree. My two cents is that you can't treat StatefulSet as universal runtime for every stateful application. We use it to ease the implementation because it does help us create PVCs and PV from templates. That's much better than reimplementing everything StatefulSet provides.

See more examples of stateful applications running on K8s. Vitess, for example, manages pods by itself. Not every stateful apps can leverage StatefulSet as-is.

ARC can safely leverage StatefulSet as the volumeClaimTemplaets and the statefulset controller's functionality that creates PVCs and PVs are exactly what we want. It's just that GitHub Actions API doesn't provide us ways to prevent the runner from being terminated prematurely when the runner pod is restarted and registered with the same name, hence we need StatefuSet-per-Runner to avoid the same reuse.

Did you actually read #1340 and the 0.22.0 release note? Without doing so, you can't get the full context. If you read it, I don't think you can easily say something like what you said.

RunnerSet does provide PVs shared across runner instances, especially after #1340. Why do we have to remove it? Do you really understand the use-case of RunnerSet and its difference from RunnerDeployment?? I really don't understand what you're saying.

mumoshu commented 2 years ago

I hope we were in a universe where stateful runners are safely deployed as a single statefulset. It turned out not true. But is that my fault...? I don't think it's fair to say "insane" to the outcome of someone's weeks of effort that does solve real-world problems. Also, I don't want to be forced to stick to something that looks legit but doesn't actually solve the problem.