Multi-AZ Volume Node Affinity Conflict

Describe the bug

There is a misalignment of volumes being provisioned in multi-AZ clusters. This causes volsync job-pods to be unscheduleable.

On my non multi-AZ cluster, volsync pods are scheduled without incident, since both volumes will not have any AZ restriction for mounting. However, on my multi-AZ GKE cluster the two volumes for X-backup-cache, and X-backup-src cause X-backup job to stall, since the pod cannot be scheduled with: 9 node(s) had volume node affinity conflict., as the volumes are in different zones, so no node will satisfy the pods requirements.

Steps to reproduce

Create a multi-AZ cluster in GKE. Create any ReplicationSource resource, e.g:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  labels:
    argocd.argoproj.io/instance: ghost
  name: ghost-backup
  namespace: ghost
spec:
  restic:
    copyMethod: Clone
    pruneIntervalDays: 7
    repository: ghost-backup
    retain:
      daily: 5
      hourly: 6
      monthly: 2
      weekly: 4
      yearly: 1
  sourcePVC: ghost
  trigger:
    schedule: 0 * * * 1

This will likely then create owned resources like so: The pod will likely be unable to mount both volumes, and as such is unscheduled permenantly.

Expected behavior

I would expect both provisioned PVs to be allocated to the same availability zone as the PV being backed up.

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: pd.csi.storage.gke.io
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: ghost
    namespace: ghost
  csi:
    driver: pd.csi.storage.gke.io
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: ***-pd.csi.storage.gke.io
    volumeHandle: projects/***/zones/europe-west2-c/disks/pvc-***
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.gke.io/zone
          operator: In
          values:
          - europe-west2-c
  persistentVolumeReclaimPolicy: Delete
  storageClassName: standard-rwo
  volumeMode: Filesystem
status:
  phase: Bound

Actual results

Since GKE assigns zones randomly unless specified https://cloud.google.com/kubernetes-engine/docs/concepts/persistent-volumes#pd-zones you will end up with something like this:

When inspected, the two pvs created by volsync will look something like this:

apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: pd.csi.storage.gke.io
      volume.kubernetes.io/provisioner-deletion-secret-name: ""
      volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 50Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: volsync-ghost-backup-src
      namespace: ghost
    csi:
      driver: pd.csi.storage.gke.io
      fsType: ext4
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: ***-pd.csi.storage.gke.io
      volumeHandle: projects/***/zones/europe-west2-c/disks/pvc-*** # <--- CONFLICT
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.gke.io/zone
            operator: In
            values:
            - europe-west2-c # <--- CONFLICT
    persistentVolumeReclaimPolicy: Delete
    storageClassName: standard-rwo
    volumeMode: Filesystem
  status:
    phase: Bound
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/provisioned-by: pd.csi.storage.gke.io
      volume.kubernetes.io/provisioner-deletion-secret-name: ""
      volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  spec:
    accessModes:
    - ReadWriteOnce
    capacity:
      storage: 1Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: volsync-ghost-backup-cache
      namespace: ghost
    csi:
      driver: pd.csi.storage.gke.io
      fsType: ext4
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: ***-pd.csi.storage.gke.io
      volumeHandle: projects/***/zones/europe-west2-a/disks/pvc-*** # <--- CONFLICT
    nodeAffinity:
      required:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.gke.io/zone
            operator: In
            values:
            - europe-west2-a # <--- CONFLICT
    persistentVolumeReclaimPolicy: Delete
    storageClassName: standard-rwo
    volumeMode: Filesystem
  status:
    phase: Bound
kind: List
metadata:
  resourceVersion: ""

Additional context

I can foresee a few ways to solve this issue:

Use a single volume, and rely on subPaths to split into cache and data / remove the cache. This likely won't be fit for purpose, since it appears the cache is desired to stay while the data volume is desired to change with each backup.
Rely on Volume binding mode WaitForFirstConsumer while ensuring the job is the first pod to request both volumes. (my current setup does use WaitForFirstConsumer, but the volumes do not seem to get scheduled together, I am uncertain if they are being sourced/ mounted through alternate means before the pod).
Use regional persistent disk for the jobs volumes, however the source volume might already be placed in a non-regional persistent disk, of which I am unsure what the consequences would be for backups from CSI snapshots from these other zonal PVs.

So it is currently unclear to me how one would force the zones to match, unless you completely remove the multiple volumes.

@DreamingRaven, I think it sounds to me like you're doing the correct thing with WaitForFirstConsumer.

I'm a bit surprised that this doesn't work, as both PVCs should be used for the first time by the pod created by the mover job.

It looks like you're using a copyMethod of Clone and have not specified a cacheStorageClassName so both the clone of your original source PVC and the cache volume should be using the default storageclass, which looks to be standard-rwo.

One thing I'm not sure of is how the clone works in your env, does it get provisioned immediately to the same availability zone as the source volume? According to the google docs link you pointed to, there is this:

However, Pods or Deployments don't inherently recognize the zone of pre-existing persistent disks

Which makes me wonder if the clone PVC is the issue here - it may get pre-provisioned even with WaitForFirstConsumer - this is just me guessing however. Is this possible to test by creating a clone PVC yourself?

If this is the case you could alternatively try to use copyMethod of Snapshot to see if it makes a difference. In Snapshot mode, a volumesnapshot of your source PVC is first taken, then a pvc provisioned from it which should then use WaitForFirstConsumer.

With Clone

@tesshuflower I have created 3 clone PVCs. Without a pod to mount them they do indeed wait for first consumer using the default storage class:

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    components.gke.io/component-name: pdcsi
    components.gke.io/component-version: ***
    components.gke.io/layer: addon
    storageclass.kubernetes.io/is-default-class: "true"
  labels:
    addonmanager.kubernetes.io/mode: EnsureExists
    k8s-app: gcp-compute-persistent-disk-csi-driver
  name: standard-rwo
parameters:
  type: pd-balanced
provisioner: pd.csi.storage.gke.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

NAME                                      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-ghost-mysql-0                        Bound     pvc-e05137fb-2eec-4a05-b2e7-d69dd2219600   50Gi       RWO            standard-rwo   70d
ghost                                     Bound     pvc-175e420f-71f8-4067-92d1-df3cf2f11701   50Gi       RWO            standard-rwo   70d
ghost-clone-a                             Pending                                                                        standard-rwo   75s
ghost-clone-b                             Pending                                                                        standard-rwo   75s
ghost-clone-c                             Pending                                                                        standard-rwo   75s

I then bound each volume individually to different pods to test which availability zone they end up in. I expected them to end up in the same zone as the source volume:

which is indeed the case for all volumes:

    csi:
      driver: pd.csi.storage.gke.io
      fsType: ext4
      volumeAttributes:
        storage.kubernetes.io/csiProvisionerIdentity: ***-pd.csi.storage.gke.io
      volumeHandle: projects/***/zones/europe-west2-c/disks/pvc-***

So then the question I find myself asking is why does the cache volume not also end up being assigned to the same zone, if both volumes are being provisioned for the same pod. I then tested by deleting the replicationsource and reinstating it to see the order in which they are provisioned. It looks to me that the cache volume is provisioned almost instantly, I suspect since it is being provisioned first there is no consideration taking place for the in-progress backup volume, which takes significantly longer to clone. Peek 2024-07-09 08-41

I will shortly try with snapshots, which I hope will inform the volume placement earlier in the chain!

With Snapshot

I changed the .spec.restic.CopMethod type of the replication source to Snapshot (as @tesshuflower recommended), which provisioned the snapshot first before other resources. (I also added the annotation snapshot.storage.kubernetes.io/is-default-class: "true" to the GKE default volumesnapshotclass as per the docs) This lead to a successful initial pod: Peek 2024-07-09 09-05

However, at this stage I was concerned that the next backup, since the cache volume now already exists would fail, so I reduced the cron to activate every 10 minutes to confirm. I found that the next tick did also complete successfully. Although I am yet to confirm the backup with a restore. Which is the next operation I want to check.

Interestingly however, the cache volume is still in europe-west2-a, I checked the provisioned volumes after the volume snapshot, and they too end up in europe-west2-a the same as the cache volume. So it appears that the data is actually moving zones since it originated from europe-west2-c in the ghost volume through the snapshot creating X-backup-src pv like so:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: pd.csi.storage.gke.io
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/pd-csi-storage-gke-io
  name: pvc-***
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: volsync-ghost-backup-src
    namespace: ghost
  csi:
    driver: pd.csi.storage.gke.io
    fsType: ext4
    volumeAttributes:
      storage.kubernetes.io/csiProvisionerIdentity: ***-pd.csi.storage.gke.io
    volumeHandle: projects/***/zones/europe-west2-a/disks/pvc-*** # <--- MOVED from europe-west2-c
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.gke.io/zone
          operator: In
          values:
          - europe-west2-a # <--- MOVED from europe-west2-c
  persistentVolumeReclaimPolicy: Delete
  storageClassName: standard-rwo
  volumeMode: Filesystem
status:
  phase: Bound

So this appears to work. I will restore from this data to confirm, since the volume is also moving zones I want to confirm the data inside the volume has too, since this zone migration is surprising behaviour.

Ok I can confirm the backups work from the zone-migrated volumes. Although cloned volumes do not work for the aforementioned issue. Seeing as how this issue was geared towards solving the AZ issue, rather than specifying cloned volumes I would say this is resolved.

As an aside, I note that restored-to-volumes, do not get wiped on restore. This is my current restore as per the docs:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationDestination
metadata:
  name: ghost-restore
spec:
  trigger:
    manual: restore-once
  restic:
    repository: ghost-backup
    destinationPVC: ghost
    copyMethod: Snapshot

Is there any option in volSync to do this, or are there any established setups / patterns for doing so? @tesshuflower thanks for your help, it is much appreciated.

@DreamingRaven thanks for the detailed information, this was an interesting one. Glad to hear that snapshots do seem to work for your use-case.

Right now you'll get a new empty PVC if you provision a new one yourself rather than re-using, or use something like the volume populator to get a new PVC.

There's a long discussion here about using the volume populator, in case the use-case mentioned is in any way similar to yours: https://github.com/backube/volsync/issues/627#issuecomment-1663933508

If your use-case is really about trying to synchronize data to a PVC on a remote cluster (i.e. a sync operation that you will run repeatedly at the destination), you could potentially look at using the rclone or rsync-tls movers.

OK, I will have a look. I am creating a staging environment that I want to allow some drift, then after a period of time it should be wiped and set to the same state as production. Thanks for your time @tesshuflower, it sounds like the volume populator is exactly what I need with a separate cron deletion! Then ArgoCD will recreate the resource and re-pull the backup, returning the staging environment to a production-like state.

backube / volsync

Multi-AZ Volume Node Affinity Conflict #1329

With Clone

With Snapshot