PVs created from RunnerSet are not reused - results in PVs buildup

clepsag commented 1 year ago

Checks

[X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
[X] I'm not using a custom entrypoint in my runner image

Controller Version

0.22.0

Helm Chart Version

v0.7.2

CertManager Version

v1.6.1

Deployment Method

Helm

cert-manager installation

resource "helm_release" "cert_manager" {
  name             = "cert-manager"
  repository       = "https://charts.jetstack.io"
  chart            = "cert-manager"
  version          = "v1.6.1"
  namespace        = "cert-manager"
  create_namespace = true

  set {
    name  = "installCRDs"
    value = "true"
  }
}

Checks

[X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
[X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
[X] My actions-runner-controller version (v0.x.y) does support the feature
[X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
[X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

# RunnerSet
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: {{ include "arc-runnerset.fullname" . }}
spec:
  ephemeral: true
  replicas: {{ .Values.runners.count }}
  repository: {{ .Values.github.repository }}
  dockerdWithinRunnerContainer: true
  image: {{ .Values.runner.image }}
  workDir: /home/runner/work
  dockerMTU: 1450
  labels:
    - k8s
    - dind
  selector:
    matchLabels:
      app: {{ include "arc-runnerset.fullname" . }}
  serviceName: {{ include "arc-runnerset.fullname" . }}
  template:
    metadata:
      labels:
        app: {{ include "arc-runnerset.fullname" . }}
    spec:
      containers:
        - name: runner
          env: []
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: docker
              mountPath: /var/lib/docker
            - name: gradle
              mountPath: /home/runner/.gradle
          resources:
            limits:
              memory: "5Gi"
            requests:
              memory: "2Gi"
  volumeClaimTemplates:
    - metadata:
        name: docker
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: local-path
        resources:
          requests:
            storage: 10Gi
    - metadata:
        name: gradle
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: local-path
        resources:
          requests:
            storage: 5Gi

To Reproduce

1. Configure volumeClaimTemplates for the RunnerSet
2. Run the workflows multiple times
3. Unproportional PVs are created from the workflow runs

Describe the bug

I configured volumeClaimTemplates for the RunnerSet and number of replicas is 5. The volumeClaimTemplates contains 2 persistent volume mappings - one for docker and another for gradle. The runners are configured as ephemeral: true.

At the start of the RunnerSet deployment 10 PVs (5*2 - one for docker and another for gradle) are created and bound to all the runner pods. When a newly assigned workflow is run and completed on a runner, the runner pod is deleted and a brand-new pod is created and listens for jobs. Unfortunately, the newly created pod does not attach the recently freed available PVs (from the deleted runner pod), but instead it creates new set of PVs and attaches them to it.

Over the period of time these redundant PVs accumulate and the system becomes out of disk space.

Describe the expected behavior

PVs created from the RunnerSet deployment should be used efficiently.

When a newly assigned workflow is run and completed on a runner, the newly created pod should attach the recently freed available PVs from the deleted runner pod.

Whole Controller Logs

2023-02-12T17:53:49Z    INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2023-02-12T17:53:49Z    INFO    Initializing actions-runner-controller  {"version": "v0.27.0", "default-scale-down-delay": "10m0s", "sync-period": "1m0s", "default-runner-image": "summerwind/actions-runner:latest", "default-docker-image": "docker:dind", "common-runnner-labels": null, "leader-election-enabled": true, "leader-election-id": "actions-runner-controller", "watch-namespace": ""}
2023-02-12T17:53:49Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runner"}
2023-02-12T17:53:49Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "actions.summerwind.dev/v1alpha1, Kind=Runner", "path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runner"}
2023-02-12T17:53:49Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2023-02-12T17:53:49Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerDeployment", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runnerdeployment"}
2023-02-12T17:53:49Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2023-02-12T17:53:49Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "actions.summerwind.dev/v1alpha1, Kind=RunnerReplicaSet", "path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-actions-summerwind-dev-v1alpha1-runnerreplicaset"}
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-runner-set-pod"}
2023-02-12T17:53:49Z    INFO    starting manager
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook.webhooks Starting webhook server
2023-02-12T17:53:49Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2023-02-12T17:53:49Z    INFO    controller-runtime.webhook  Serving webhook server  {"host": "", "port": 9443}
2023-02-12T17:53:49Z    INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
2023-02-12T17:53:49Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
I0212 17:53:49.873644       1 leaderelection.go:248] attempting to acquire leader lease actions-runner-system/actions-runner-controller...
I0212 17:54:06.978268       1 leaderelection.go:258] successfully acquired lease actions-runner-system/actions-runner-controller
2023-02-12T17:54:06Z    DEBUG   events  actions-runner-controller-86cb9958c6-jqm5h_576e790b-bfab-4da6-b8f6-8417f1a7ad26 became leader   {"type": "Normal", "object": {"kind":"Lease","namespace":"actions-runner-system","name":"actions-runner-controller","uid":"1a603795-f559-4b7e-8db4-2a41873f879f","apiVersion":"coordination.k8s.io/v1","resourceVersion":"23530004"}, "reason": "LeaderElection"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runner-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "Runner", "source": "kind source: *v1alpha1.Runner"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runner-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "Runner", "source": "kind source: *v1.Pod"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runner-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "Runner"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerdeployment-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerDeployment", "source": "kind source: *v1alpha1.RunnerDeployment"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerdeployment-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerDeployment", "source": "kind source: *v1alpha1.RunnerReplicaSet"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runnerdeployment-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerDeployment"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "source": "kind source: *v1alpha1.RunnerReplicaSet"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "source": "kind source: *v1alpha1.Runner"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerpod-controller", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runnerpod-controller", "controllerGroup": "", "controllerKind": "Pod"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerSet", "source": "kind source: *v1alpha1.RunnerSet"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerSet", "source": "kind source: *v1.StatefulSet"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runnerset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerSet"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerpersistentvolume-controller", "controllerGroup": "", "controllerKind": "PersistentVolume", "source": "kind source: *v1.PersistentVolume"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runnerpersistentvolume-controller", "controllerGroup": "", "controllerKind": "PersistentVolume"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "horizontalrunnerautoscaler-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "HorizontalRunnerAutoscaler", "source": "kind source: *v1alpha1.HorizontalRunnerAutoscaler"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "horizontalrunnerautoscaler-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "HorizontalRunnerAutoscaler"}
2023-02-12T17:54:06Z    INFO    Starting EventSource    {"controller": "runnerpersistentvolumeclaim-controller", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "source": "kind source: *v1.PersistentVolumeClaim"}
2023-02-12T17:54:06Z    INFO    Starting Controller {"controller": "runnerpersistentvolumeclaim-controller", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim"}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runnerreplicaset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerReplicaSet", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runnerdeployment-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerDeployment", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runnerpod-controller", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runnerpersistentvolume-controller", "controllerGroup": "", "controllerKind": "PersistentVolume", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runnerset-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "RunnerSet", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "horizontalrunnerautoscaler-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "HorizontalRunnerAutoscaler", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runnerpersistentvolumeclaim-controller", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "worker count": 1}
2023-02-12T17:54:07Z    INFO    Starting workers    {"controller": "runner-controller", "controllerGroup": "actions.summerwind.dev", "controllerKind": "Runner", "worker count": 1}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-xwdsl-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-agent-arc-runnerset-v77jt-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-agent-arc-runnerset-tb44r-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-xwdsl-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-44lzs-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-agent-arc-runnerset-tb44r-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-h7fch-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-twzn2-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-fztg4-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-r6r7b-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-497t5-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-bgjjc-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-ftn9q-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-nndjv-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-agent-arc-runnerset-v77jt-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-h7fch-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-r6r7b-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-497t5-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-bgjjc-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-twzn2-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-fztg4-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-nndjv-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-44lzs-0", "requeueAfter": "10s"}
2023-02-12T17:54:07Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-ftn9q-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-xwdsl-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-agent-arc-runnerset-v77jt-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-agent-arc-runnerset-tb44r-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-xwdsl-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-44lzs-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-agent-arc-runnerset-tb44r-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-h7fch-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-twzn2-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-fztg4-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-r6r7b-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-497t5-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-bgjjc-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-ftn9q-0", "requeueAfter": "10s"}
2023-02-12T17:54:17Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-nndjv-0", "requeueAfter": "10s"}
.
.
.
2023-02-13T15:30:00Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-qtszh-0", "requeueAfter": "10s"}
2023-02-13T15:30:00Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-qtszh-0", "requeueAfter": "10s"}
2023-02-13T15:30:01Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-nb6m5-0", "requeueAfter": "10s"}
2023-02-13T15:30:01Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-nb6m5-0", "requeueAfter": "10s"}
2023-02-13T15:30:02Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-jr8pg-0", "requeueAfter": "10s"}
2023-02-13T15:30:02Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-jr8pg-0", "requeueAfter": "10s"}
2023-02-13T15:30:03Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-zx5fr-0", "requeueAfter": "10s"}
2023-02-13T15:30:03Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-zx5fr-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-agent-arc-runnerset-v77jt-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-agent-arc-runnerset-tb44r-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-agent-arc-runnerset-tb44r-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-agent-arc-runnerset-v77jt-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-x2r5j-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/gradle-vm-xxxx-arc-runnerset-x2r5j-0", "requeueAfter": "10s"}
2023-02-13T15:30:05Z    DEBUG   runnerpersistentvolumeclaim Retrying sync until statefulset gets removed    {"pvc": "default/docker-vm-xxxx-arc-runnerset-7x925-0", "requeueAfter": "10s"}

Whole Runner Pod Logs

NA

Additional Context

NA

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

lhaussknecht commented 1 year ago

We had a similar problem. The cluster was set up zone redundant but the storage was LRS. A new pod came up on another node and the available PV was not attached. With 3 zones the PVs quickly piled up.

We now schedule the runners in one zone only.

Check the attach-detach controller events for additional information.

remover commented 1 year ago

similar for us.

in our case there could be already 115 “Available” PVs with the name var-lib-docker. i notice that new PVs are getting created despite the fact that only max 30 pods are requesting a PV using this claim below. our volume claim template on our RunnerSet resource looks like this:

volumeClaimTemplates:
  - metadata:
      name: var-lib-docker
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
      storageClassName: var-lib-docker

PVs keep accumulating in this way. any idea why this would happen?

grzleadams commented 1 year ago

I assume this is related to this issue. The problem is that the StatefulSet isn't being scaled, additional StatefulSets are being added to the RunnerSet. So if the old StatefulSets are being deleted (again, not scaled back) then the PVs will persist.

mhuijgen commented 1 year ago

For us this seems to happen on Azure AKS when the new scaled up pod cannot be immediately scheduled in the cluster and needs to wait for the azure node autoscaling. After AKS added a new node, the new pod is scheduled, but then in the PVC events you can see the following:

  Normal   WaitForFirstConsumer   4m7s                   persistentvolume-controller                                                                        waiting for first consumer to be created before binding
  Normal   WaitForPodScheduled    2m51s (x7 over 4m7s)   persistentvolume-controller                                                                        waiting for pod poc-runnerset-sj9gd-0 to be scheduled
  Normal   ExternalProvisioning   2m45s                  persistentvolume-controller                                                                        waiting for a volume to be created, either by external provisioner "disk.csi.azure.com" or manually created by system administrator

 Warning  ProvisioningFailed     2m44s (x2 over 2m45s)  disk.csi.azure.com_csi-azuredisk-controller-665c6f77c7-wwwqx_2d20d2a4-432d-4c9b-9359-c1fb1961164a  failed to provision volume with StorageClass "arc-var-lib-docker": error generating accessibility requirements: no topology key found on CSINode aks-defaultnp-56690459-vmss000008

  Normal   Provisioning           2m42s (x3 over 2m45s)  disk.csi.azure.com_csi-azuredisk-controller-665c6f77c7-wwwqx_2d20d2a4-432d-4c9b-9359-c1fb1961164a  External provisioner is provisioning volume for claim "actions-runner-system/var-lib-docker-a-poc-runnerset-sj9gd-0"
  Normal   ProvisioningSucceeded  2m39s                  disk.csi.azure.com_csi-azuredisk-controller-665c6f77c7-wwwqx_2d20d2a4-432d-4c9b-9359-c1fb1961164a  Successfully provisioned volume pvc-aa32de10-bbcc-41f9-9dda-ae16ce6075d1

Even though there are plenty of free PV's, every time right after node scale-up the first PVC fails to attach an existing free PV and a new one is created. When a next runner pod is scheduled on this new node, it does manage to attach any of the existing PV's, so it seems like some race condition here between the arc controller pod and the CSI auto provisioner?

Every time after a node scale-up we get one extra PV.

benmccown commented 1 year ago

@mhuijgen Did you ever figure out any sort of solution for this? We're running into the exact same issue using EKS with autoscaling and the EBS CSI driver for dynamic PV provisioning.

mhuijgen commented 1 year ago

@benmccown No unfortunately not. The same also occurs occasionally even without node scaling in our tests, making this feature unusable at this time. Nodescaleup is just making the issue appear more often. It seems to be a race condition between the runner controller trying to link the new pvc to an existing volume and the auto provisioner in the cluster creating a new pv.

benmccown commented 1 year ago

@mhuijgen Thanks for the response. For yourself (and anyone else who runs into this issue) I think I've come up with the best possible workaround I can think of for the moment, which is basically to abandon the use of dynamic storage provisioning entirely and use static PVs. I'll provide details on our workaround, but first I'll give a brief summary of our setup and use case for ARC maintainers in case they read this.

Our Setup and Problem Details

We are using cluster-autoscaler to manage our EKS autoscaling groups. We have a dedicated node group for our actions runners (I'll use the term CI runners). We use node labels, node taints, and resource requests to manage this node group so that only GitHub CI pods run on the CI runner node group. So each CI pod is running in a 1:1 relationship with nodes (one pod per node). We have 0 set as our minimum autoscaling size for this node group. We're using a RunnerSet in combination with HorizontalRunnerAutoscaler to deploy the CI pods in combination with ARC. The final piece of the puzzle is that our CI image is rather heavy at 5GB+.

We're regularly scaling down to zero nodes in periods of inactivity, but we might have a burst of activity where several workflows are kicked off and thus several CI pods are created and scheduled. Cluster autoscaler will then respond in turn and scale out our ASG and join new nodes to the cluster that will execute the CI workloads. Without any sort of image cache we waste ~2m30s for every single CI job to pull our container image into the dind (docker in docker) container within the CI pod. We could set ephemeral: false in our RunnerSet/RunnerDeployment but that still doesn't solve the autoscaling problem if we scale down and then back up. So we really need image caching to work for us to use autoscaling effectively. We're accomplishing this by mounting ReadWriteOnce PVs (EBS volumes) to /var/lib/docker so that each PV can only be mounted once (since sharing /var/lib/docker is bad practice).

The issue we're seeing has already been detailed well by @mhuijgen and is definitely some sort of race condition as they've said. The result (in my testing) is that in a few short days we had 20+ persistent volumes provisioned and the number was continuing to grow. Aside from the orphaned PVs left around (and resulting EBS volumes) costing us money, the major downside is that it seems almost 100% of the time when a new CI job is scheduled, pod created, and resulting worker node is created (by cluster autoscaler due to scale out operation) it seems a new PV is created and one isn't reused, which completely eliminates the point of an image cache and any of the performance benefits.

Workaround

The workaround for us is to provision static PVs using Terraform instead of letting the EBS CSI controller manage dynamic volume provisioning.

We're using Terraform to deploy/manage our EKS cluster, EKS node groups, as well as associated resources (helm charts and raw k8s resources too). I set up a basic for loop in Terraform that provisions N EBS volumes and static PVs where N is the maximum autoscaling group size for my CI runners node group. Right now this value is set at 10 (minimum autoscaling size is 0) so 10 EBS volumes are provisioned and 10 persistent volumes are then provisioned and tied to the respective EBS volumes, with a storage class of "manual-github-arc-runners" or something to that effect. There will always be an EBS volume and associated PV for every node since they're tied to the same max_autoscaling_size var in our infrastructure as code.

This way the CSI controller isn't trying to create dynamic PVs at all and the volumes are always reused. So the race condition is eliminated by removing 1 of the 2 parties participating in the "race".

The downsides here are that EBS volumes are availability zone specific, so I have to put the node group in a single subnet and availability zone. And you're paying for the max number of EBS volumes which is a downside I guess, except the bug we're running into with our use case means you'll end up with WAY more volumes than your max autoscaling size eventually anyway.

I'll probably set up a GH workflow that runs 10 parallel jobs daily to ensure the container image is pulled down and up to date on the PVs.

Hope this helps someone in the future.

irasnyd commented 11 months ago

I think I am hitting the same bug. As far as I can tell, it began after my transition from the built-in EBS provisioner to the EBS CSI provisioner.

For example, using dynamically allocated PV/PVC with a StorageClass that looks like this works correctly (PVs don't build up forever):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2
parameters:
  fsType: ext4
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false

However, a dynamically allocated PV/PVC with a StorageClass that looks like this builds up PVs:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
parameters:
  csi.storage.k8s.io/fstype: xfs
  encrypted: "true"
  type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false

I think this issue is related or a duplicate: https://github.com/actions/actions-runner-controller/issues/2266

rdepres commented 10 months ago

Here's some info I found.

Like @mhuijgen, I noticed peculiar warnings in our events:

error generating accessibility requirements: no topology key found on CSINode ip-10-1-38-149.eu-west-1.compute.internal

Each of these warnings coincided with the creation of a new volume. Checking on the CSINode resource in Kubernetes revealed that the topology keys were set though:

apiVersion: storage.k8s.io/v1
kind: CSINode
# [...]
spec:
  drivers:
  # [...]
  - #[...]
    name: ebs.csi.aws.com
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

So I came to the conclusion there is indeed a race condition: somehow, if the CSI node doesn't have the topology keys set at the moment a volume is requested, then a new volume is created, even though there could be plenty available. This explains why this issue only happens with pods scheduled on fresh nodes.

So I've put in place a workaround. In short, it consists of:

a taint on the nodes to prevent runners from scheduling;
a matching label;
a DaemonSet targeting this label, whose purpose is to wait for the topology keys in the CSI node to be set, at which point it remove the label and taint, thus allowing runners to schedule and preventing itself from scheduling again on the node.

So far, it's been working great. Our EBS volumes are consistently being reused.

I don't know the exact root cause here, but I'm pretty sure it's not ARC's fault. As a matter of fact, it seems someone is able to reproduce the issue here with simple StatefulSets.

actions / actions-runner-controller