backube / snapscheduler

Scheduled snapshots for Kubernetes persistent volumes
https://backube.github.io/snapscheduler/
GNU Affero General Public License v3.0
262 stars 26 forks source link

Retention policy removes last valid snapshot, leaving no possibility of recovery #688

Open mnacharov opened 2 weeks ago

mnacharov commented 2 weeks ago

Describe the bug VolumeSnapshot has the .status.readyToUse flag which indicates if a snapshot is ready to be used to restore a volume. snapscheduler does not take this flag into account when deciding weather the maxCount retention has been reached. This results in the loss of the last opportunity for recovery.

Steps to reproduce in GKE(in my case v1.28.11) with snapscheduler(v3.4.0) installed:

  1. create PVC:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: snapscheduler-test
      namespace: default
      labels:
        snapscheduler-test: "true"
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: standard-rwo
  2. run some pod with new pvc in order to create the volume: $ kubectl -n default run -it --rm snapscheduler-test --image=gcr.io/distroless/static-debian12 --overrides='{"spec": {"restartPolicy": "Never", "volumes": [{"name": "pvc", "persistentVolumeClaim":{"claimName": "snapscheduler-test"}}]}}' -- sh
  3. create SnapshotSchedule:
    apiVersion: snapscheduler.backube/v1
    kind: SnapshotSchedule
    metadata:
      name: snapscheduler-test
      namespace: default
    spec:
      claimSelector:
        matchLabels:
          snapscheduler-test: "true"
      retention:
        maxCount: 3
      schedule: "*/5 * * * *"
  4. wait 5-10 minutes, make sure that volumeshapshots successfully creating:
    $ kubectl -n default get volumesnapshot
    NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    snapscheduler-test-snapscheduler-test-202408301525   true         snapscheduler-test                           1Gi           p2p-csi         snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821   87s            2m6s
  5. remove compute disk in GCP (via WebUI or gcloud command) -- human error had happened :
    $ pv=$(kubectl -n default get pvc snapscheduler-test -ojsonpath='{.spec.volumeName}')
    $ zone=$(gcloud --project=$GCP_PROJECT compute disks list --filter="name=($pv)"|grep pvc|awk '{print $2}')
    $ gcloud --project p2p-data-warehouse compute disks delete $pv --zone $zone
  6. after 10 minutes there are two volumesnapshots with readytouse=false:
    $ kubectl -n default get volumesnapshot
    NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    snapscheduler-test-snapscheduler-test-202408301525   true         snapscheduler-test                           1Gi           p2p-csi         snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821   10m            11m
    snapscheduler-test-snapscheduler-test-202408301530   false        snapscheduler-test                                         p2p-csi         snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a                  6m38s
    snapscheduler-test-snapscheduler-test-202408301535   false        snapscheduler-test                                         p2p-csi         snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3                  98s
  7. after 15 minutes we don't have any valid snapshot anymore (maxCount: 3 retention policy)
    $ kubectl -n default get volumesnapshot
    NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    snapscheduler-test-snapscheduler-test-202408301530   false        snapscheduler-test                                         p2p-csi         snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a                  13m
    snapscheduler-test-snapscheduler-test-202408301535   false        snapscheduler-test                                         p2p-csi         snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3                  8m6s
    snapscheduler-test-snapscheduler-test-202408301540   false        snapscheduler-test                                         p2p-csi         snapcontent-b6113f79-3219-435d-8321-812ddc096154                  3m6s

    Expected behavior ❗ retention policy must not take into account VolumeSnapshots with .status.readyToUse==false. ❔ if possible, create a new snapshot only after the previous one has entered the ready state

Actual results retention policy removes last valid snapshot, leaving no possibility of recovery

Additional context

JohnStrunk commented 1 week ago

I agree... that's not good. I'm happy to have thoughts/suggestions on a good fix.

A few ideas:

  1. Only count readyToUse snapshots when implementing the cleanup policy This runs the risk of creating an unbounded number of (unready) snapshots, potentially consuming all available space (or excessive expense)
  2. Skip the next snapshot if the previous is not ready This will cause problems for environments where it takes a long time for the snapshot to become ready (e.g., AWS), causing SnapScheduler to miss intervals
  3. If the policy determines that a snapshot should be deleted, we delete unready snapshots (starting with the oldest) before ready ones. This has the same problem as (2) in being unable to handle intervals that are less than the time for a snapshot to become ready.