FairwindsOps / gemini

Automated backups of PersistentVolumeClaims in Kubernetes using VolumeSnapshots
https://fairwinds.com
Apache License 2.0
349 stars 28 forks source link

Unable to trigger a PVC swap based on annotation #226

Closed MassimoVlacancich closed 3 months ago

MassimoVlacancich commented 7 months ago

What happened?

After installing Gemini on my cluster and following the provided instructions to backup a Volume, I was unable to re-instate an older snapshot.

@

What did you expect to happen?

We let Gemini create a first snapshot. We then wrote some data by hand at the mount point of the volume being backed-up (details below) We let Gemini create a second snapshot. We then followed to below commands to backup to the first snapshot, where the file should not be present.

kubectl scale all --all --replicas=0
kubectl annotate snapshotgroup/dev-postgres-backup --overwrite "gemini.fairwinds.com/restore=1711982369"
kubectl scale all --all --replicas=1

But despite this, when navigating to the mountpoint within the postgres pod which mounts the volume claim being backed-up, we still see the file. In short, the backup doesn't seem to be working as expected; the same applies when writing data within the DB which writes it to the mount point pgdata directory.

How can we reproduce this?

We are using k8s 1.25 and installed the latest version of Gemini with v2 CRDs (fyi, I don't think the CRD for v1beta1 exists at https://raw.githubusercontent.com/FairwindsOps/gemini/main/pkg/types/snapshotgroup/v1beta1/crd-with-beta1.yaml, we instead installed the one at https://raw.githubusercontent.com/FairwindsOps/gemini/main/pkg/types/snapshotgroup/v1/crd-with-beta1.yaml (/v1beta1 vs /v1)

We are using a PVC to provide a mount point where our postgres-db can write its data, below is the config (simplified for convenience):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dev-postgres
  namespace: dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
        envFrom:
        - configMapRef:
            name: dev-postgres-config
        image: postgres:latest
        imagePullPolicy: IfNotPresent
        name: postgres
        ports:
        - containerPort: 5432
        volumeMounts:
        - mountPath: /var/lib/postgresql/data
          name: postgresdb-data-volume
      hostname: postgres
      volumes:
      - name: postgresdb-data-volume
        persistentVolumeClaim:
          claimName: dev-postgres-claim

The persistent volume is only provisioned Dynamically by GCP once the volume claim manifest is applied and the DB mounts it:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: postgres
  name: dev-postgres-claim
  namespace: dev
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: standard-rwo

We have specifically adjusted this to be a standard-rwo (ReadWriteOnce) as to ensure that the data on the volume isn’t being modified by multiple nodes at the same time when a snapshot is being taken.

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                            STORAGECLASS   REASON   AGE
pvc-4a079b5e-91f8-400f-97ca-99ea609b4f4e   2Gi        RWO            Delete           Bound      dev/dev-postgres-claim           standard-rwo            14d

Moreover, we defined our snapshot class as follows before adding the snapshot config

apiVersion: snapshot.storage.k8s.io/v1
deletionPolicy: Delete
driver: pd.csi.storage.gke.io
kind: VolumeSnapshotClass
metadata:
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
  name: dev-gcp-csi-snapshotclass
  namespace: dev
---
apiVersion: gemini.fairwinds.com/v1
kind: SnapshotGroup
metadata:
  name: dev-postgres-backup
  namespace: dev
spec:
  persistentVolumeClaim:
    claimName: dev-postgres-claim
  schedule:
  - every: 5 minutes
    keep: 3
  template:
    spec:
      volumeSnapshotClassName: dev-gcp-csi-snapshotclass

The above works as expected once in place and we see the snapshots being created and in a ready state

NAME                             READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS               SNAPSHOTCONTENT                                    CREATIONTIME   AGE
dev-postgres-backup-1711982369   true         dev-postgres-claim                           2Gi           dev-gcp-csi-snapshotclass   snapcontent-b857786d-aa1c-4e82-9b2b-50f801c8c6ef   17m            17m
dev-postgres-backup-1711982669   true         dev-postgres-claim                           2Gi           dev-gcp-csi-snapshotclass   snapcontent-16337f17-f2bc-41a8-b630-581b5f06f888   12m            12m
dev-postgres-backup-1711982969   true         dev-postgres-claim                           2Gi           dev-gcp-csi-snapshotclass   snapcontent-2c126506-c999-4dba-930c-009034405a4d   7m33s          7m37s
dev-postgres-backup-1711983269   true         dev-postgres-claim                           2Gi           dev-gcp-csi-snapshotclass   snapcontent-b656ab5a-ea09-4ba3-ad00-4dc6bb4e110f   2m33s          2m37s

As detailed above, we then wrote some data by hand (added a file in the mount location /var/lib/postgresql/data between snapshots 1 and 2 above (1711982369 does not have the file, while 1711982669 does).
When then run the following commands to re-instore the first snapshot without a file

kubectl scale all --all --replicas=0
kubectl annotate snapshotgroup/dev-postgres-backup --overwrite "gemini.fairwinds.com/restore=1711982369"
kubectl scale all --all --replicas=1

But despite this, when navigating to /var/lib/postgresql/data within the postgres pod which mounts the volume claim being backedup, we still see the file. What I also find interesting is that the PVC still shows its age to be 14d ago, I would expect this to be a brand new one re-instated from snapshot.

When investigating the logs for the gemini-controller pod I don't see any specific errors after the restart post-annotation, nor do I see anything that points to the swap being successfull:

I0401 14:25:59.640654       1 controller.go:179] Starting SnapshotGroup controller
I0401 14:25:59.640679       1 controller.go:181] Waiting for informer caches to sync
I0401 14:25:59.641048       1 reflector.go:287] Starting reflector *v1.SnapshotGroup (30s) from pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231
I0401 14:25:59.641071       1 reflector.go:323] Listing and watching *v1.SnapshotGroup from pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231
I0401 14:25:59.740942       1 shared_informer.go:341] caches populated
I0401 14:25:59.741060       1 controller.go:186] Starting workers
I0401 14:25:59.741136       1 controller.go:191] Started workers
I0401 14:25:59.741204       1 groups.go:38] dev/dev-postgres-backup: reconciling
I0401 14:25:59.750629       1 pvc.go:48] dev/dev-postgres-claim: PVC found
I0401 14:25:59.750661       1 groups.go:29] dev/dev-postgres-backup: updating PVC spec
W0401 14:25:59.761896       1 warnings.go:70] unknown field "spec.persistentVolumeClaim.spec.volumeMode"
W0401 14:25:59.761922       1 warnings.go:70] unknown field "spec.template.spec.source"
W0401 14:25:59.761928       1 warnings.go:70] unknown field "status"
I0401 14:25:59.770677       1 groups.go:53] dev/dev-postgres-backup: found 2 existing snapshots
I0401 14:25:59.770713       1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981440
I0401 14:25:59.770723       1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981140
I0401 14:25:59.770731       1 scheduler.go:91] need creation 5 minutes false
I0401 14:25:59.770739       1 groups.go:59] dev/dev-postgres-backup: going to create 0, delete 0 snapshots
I0401 14:25:59.770746       1 snapshots.go:204] Deleting 0 expired snapshots
I0401 14:25:59.770755       1 groups.go:65] dev/dev-postgres-backup: deleted 0 snapshots
I0401 14:25:59.770762       1 groups.go:71] dev/dev-postgres-backup: created 0 snapshots
I0401 14:25:59.770790       1 controller.go:144] dev/dev-postgres-backup: successfully performed backup
I0401 14:26:29.648307       1 reflector.go:376] pkg/mod/k8s.io/client-go@v0.27.1/tools/cache/reflector.go:231: forcing resync
I0401 14:26:29.648435       1 groups.go:38] dev/dev-postgres-backup: reconciling
I0401 14:26:29.656518       1 pvc.go:48] dev/dev-postgres-claim: PVC found
I0401 14:26:29.656643       1 groups.go:29] dev/dev-postgres-backup: updating PVC spec
W0401 14:26:29.665339       1 warnings.go:70] unknown field "spec.persistentVolumeClaim.spec.volumeMode"
W0401 14:26:29.665368       1 warnings.go:70] unknown field "spec.template.spec.source"
W0401 14:26:29.665373       1 warnings.go:70] unknown field "status"
I0401 14:26:29.673816       1 groups.go:53] dev/dev-postgres-backup: found 2 existing snapshots
I0401 14:26:29.673851       1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981440
I0401 14:26:29.673861       1 scheduler.go:58] Checking snapshot dev/dev-postgres-backup-1711981140
I0401 14:26:29.673869       1 scheduler.go:91] need creation 5 minutes false
I0401 14:26:29.673876       1 groups.go:59] dev/dev-postgres-backup: going to create 0, delete 0 snapshots
I0401 14:26:29.673884       1 snapshots.go:204] Deleting 0 expired snapshots

We would appreciate your input in resolving this, maybe this has to do with our cluster set up or maybe a config issue with PVs; I've requested access to the slack channel, waiting on approval :)

Thanks in advance, Massimo

Version

Version 2.0 - Kubernetes 1.25

Search

Code of Conduct

Additional context

No response

MassimoVlacancich commented 6 months ago

Hi team, could I seek your help on the above please? Happy to provide more details if required :)

MassimoVlacancich commented 6 months ago

Hi all, just chasing again, we are keen to rely on Gemini :)

MassimoVlacancich commented 5 months ago

Hi all, chasing again, would appreciate some help on this one :)