actions / actions-runner-controller

Kubernetes controller for GitHub Actions self-hosted runners
Apache License 2.0
4.46k stars 1.06k forks source link

PV build up with Reclaim policy set Delete #2266

Open harshaisgud opened 1 year ago

harshaisgud commented 1 year ago

Checks

Controller Version

0.27.0

Helm Chart Version

0.22.0

CertManager Version

1.10.1

Deployment Method

Helm

cert-manager installation

Yes I have installed cert manager following the steps mentioned in documentation.

Checks

Resource Definitions

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerSet
metadata:
  name: example-1
spec:
  replicas: 1
  organization: xyz
  labels: 
    - arc-1
    - linux
  selector:
    matchLabels:
      app: example
  serviceName: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - name: docker
        volumeMounts:
        - name: var-lib-docker
          mountPath: /var/lib/docker
  volumeClaimTemplates:
  - metadata:
      name: var-lib-docker
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 7Gi
      storageClassName: gh-ebs
      dataSource:
        name: ebs-volume-snapshot
        kind: VolumeSnapshot
        apiGroup: snapshot.storage.k8s.io
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gh-ebs
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Delete

To Reproduce

1. Install ARC.
2. Start a runnerSet with volumeClaimTemplate.
3. Run a couple of workflows.
4. Observe that PVs with ReclaimPolicy delete are building up despite PVCs being deleted.

Describe the bug

Dynamically provisioned Persistent Volumes that are in an available state are unable to cleaned up by EBS CSI with the error that the volume is still attached to the node. Example log : delete "pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a": volume deletion failed: persistentvolume pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a is still attached to node ip-10-10-2-152.eu-central-1.compute.internal.

Describe the expected behavior

Dynamically provisioned persistent volumes with ReclaimPolicy set to Delete should be deleted when PVC is deleted.

Whole Controller Logs

2023-02-06T12:33:19Z    DEBUG   runnerpersistentvolume  Retrying sync until pvc gets released   {"pv": "/pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a", "requeueAfter": "10s"}
2023-02-06T12:33:19Z    ERROR   Reconciler error    {"controller": "runnerpersistentvolumeclaim-controller", "controllerGroup": "", "controllerKind": "PersistentVolumeClaim", "PersistentVolumeClaim": {"name":"var-lib-docker-nitro-1-5d5sx-0","namespace":"actions-runner-system"}, "namespace": "actions-runner-system", "name": "var-lib-docker-example-1-5d5sx-0", "reconcileID": "7aeac10f-6998-430e-8c2a-adc94b385299", "error": "Operation cannot be fulfilled on persistentvolumes \"pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a\": the object has been modified; please apply your changes to the latest version and try again"}
2023-02-06T12:33:19Z    INFO    runnerpersistentvolume  PV should be Available now  {"pv": "/pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a"}
2023-02-06T14:29:08Z    DEBUG   runnerpersistentvolume  Retrying sync until pvc gets released   {"pv": "/pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a", "requeueAfter": "10s"}
2023-02-06T14:29:08Z    INFO    runnerpersistentvolume  PV should be Available now  {"pv": "/pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a"}
2023-02-06T14:32:22Z    DEBUG   runnerpersistentvolume  Retrying sync until pvc gets released   {"pv": "/pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a", "requeueAfter": "10s"}
2023-02-06T14:32:22Z    INFO    runnerpersistentvolume  PV should be Available now  {"pv": "/pvc-df682ae3-3b7b-4599-bdce-e9b17dda2a7a"}

Whole Runner Pod Logs

Not really related to runner logs

Additional Context

I suspect the issue is because of pending finalizer [kubernetes.io/pv-protection] on the PV. Deleting the Persistent volumes in Kubernetes does not delete the AWS EBS volumes.

github-actions[bot] commented 1 year ago

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

irasnyd commented 7 months ago

I am hitting the same bug. It began after my transition from the built-in EBS provisioner to the EBS CSI provisioner.

For example, using dynamically allocated PV/PVC with a StorageClass that looks like this works correctly (PVs don't build up forever):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp2
parameters:
  fsType: ext4
  type: gp2
provisioner: kubernetes.io/aws-ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false

However, a dynamically allocated PV/PVC with a StorageClass that looks like this builds up PVs:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
parameters:
  csi.storage.k8s.io/fstype: xfs
  encrypted: "true"
  type: gp3
provisioner: ebs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: false
midnattsol commented 4 months ago

We are hitting the same bug.

We're currently testing a solution. If it keeps working good after a couple of days I will make a pr.

For those who wants to test it as well, I have a custom image for the version v.0.26.7 in dockerhub Currently under testing

jheidbrink commented 2 months ago

is this related to https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1507?