actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.6k stars 1.09k forks source link

RunnerSet does not always re-use `Available` PV #3221

Open chaosun-abnormalsecurity opened 8 months ago

chaosun-abnormalsecurity commented 8 months ago

Checks

Controller Version

0.27.0

Helm Chart Version

0.22.0

CertManager Version

No response

Deployment Method

Helm

cert-manager installation

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: gha-runner
  namespace: cicd--ci
spec:
  dockerEnabled: true
  ephemeral: true
  group: Default
  labels:
  - ci
  replicas: 3
  repository: <REPOSITORY>
  selector:
    matchLabels:
      app: ci
  serviceName: gha-runner
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
        kubectl.kubernetes.io/default-logs-container: runner
      labels:
        app: ci
    spec:
      containers:
      - env:
        - name: DISABLE_RUNNER_UPDATE
          value: "true"
        - name: RUNNER_ALLOW_RUNASROOT
          value: "1"
        - name: RUNNER_GRACEFUL_STOP_TIMEOUT
          value: "110"
        - name: STARTUP_DELAY_IN_SECONDS
          value: "30"
        name: runner
        resources:
          limits:
            cpu: "1.8"
            memory: 7Gi
          requests:
            cpu: "1.5"
            memory: 6Gi
      - name: docker
        volumeMounts:
        - mountPath: /var/lib/docker
          name: docker
      securityContext:
        fsGroup: 1001
      serviceAccountName: gha-runner
  volumeClaimTemplates:
  - metadata:
      name: docker
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 200Gi

To Reproduce

1. Deploy the RunnerSet and let it work normally
2. `Available` PVs can grow quickly along the time (we got 4.5k in 1 month)

Describe the bug

  1. We noticed Available PVs grow quickly along the time and reached 4.5k in 1 month. This indicates the RunnerSet is not re-using PVs properly
  2. We also noticed some PVs are indeed being re-used, e.g. a Runner that was created 10m ago is using a PV that is 18d old. But the majority of Runners just spins up new volumes
  3. We use a custom runner image which is built on top of docker.io/summerwind/actions-runner. The only difference is we installed a few additional libraries and binaries, e.g. kubectl, helm, aws cli etc. and we are not using a custom entrypoint

Describe the expected behavior

As described in the doc and discussion, ARC should maintain a pool of persistent volumes to be re-used by Runners, instead of provisioning new ones for most of the Runners.

Whole Controller Logs

https://gist.github.com/chaosun-abnormalsecurity/4d92b87f3807fcbaa279e1099200d20e

Whole Runner Pod Logs

https://gist.github.com/chaosun-abnormalsecurity/4879c98298f992698ee6824c9a2d4bb6

Additional Context

No response

github-actions[bot] commented 8 months ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

waveofmymind commented 8 months ago

I ran into this problem today It creates a new PV even though an available PV exists. I'm wondering if it needs time to unbind from the PV and become available again, and if not, if it's a bug.

rdepres commented 8 months ago

I believe this issue is a duplicate of #2282.