Node affinity for snapshots

dhess commented 9 months ago

Hi, thanks for this great project! We just started using it with our Rook/Ceph volumes, and it's working great.

It doesn't work so well with OpenEBS ZFS LocalPV (ZFS-LocalPV) volumes, however. ZFS-LocalPV has first-class support for CSI snapshotting and cloning, but VolSync can't figure out that the ZFS-LocalPV snapshot of a PVC mounted on, e.g., node-a, can also only be consumed from node-a. copyMethod: Directdoesn't help here for in-use volumes, because they can't be remounted. (Actually, I seem to recall that ZFS-LocalPV does support simultaneous pod mounts with a bit of extra configuration, but I'd prefer to use snapshots for proper PiT backups, anyway.)

Would it be difficult to add first-class support to VolSync for node-local provisioners with snapshotting support, like ZFS-LocalPV? Unless I'm missing something, it seems like it should be possible: since copyMethod: Direct can determine which node a PVC is mounted on and ensure the sync is performed from that node, then naïvely, it seems that an additional configuration option could be added to tell VolSync to mount a snapshot and run the sync operation on the same node where the source PVC is mounted.

tesshuflower commented 9 months ago

@dhess This is an interesting one. I'm not sure the workaround used for Direct mode will work, as it relies on finding another active pod that's currently using the PVC and then scheduling on the same node.

In this case (if I understand correctly), a new PVC from snapshot is created, and the VolSync mover pod should then be the 1st consumer of this PVC. Normally I would have thought the pod should get scheduled automatically in the correct place, but maybe something else is going on.

Does ZFS-LocalPV use the csi topology feature? https://kubernetes-csi.github.io/docs/topology.html

One more question: When you create your original sourcePVC and then run your application pod, do you also need to manually configure that pod to run on a particular node that corresponds to where the PVC was provisioned?

dhess commented 8 months ago

Hi @tesshuflower, thanks for the quick response.

Does ZFS-LocalPV use the csi topology feature? https://kubernetes-csi.github.io/docs/topology.html

I'm not familiar with CSI Topology, but from what I can tell, it seems it does:

https://github.com/openebs/zfs-localpv/blob/d646e6b1aa779e986b8cbf5ce65b400f243c557b/deploy/helm/charts/values.yaml#L57

I'm guessing this manifest for the openebs-zfs-localpv-controller also demonstrates that it's using CSI topology:

      - args:
        - --csi-address=$(ADDRESS)
        - --v=5
        - --feature-gates=Topology=true
        - --strict-topology
        - --leader-election
        - --enable-capacity=true
        - --extra-create-metadata=true
        - --default-fstype=ext4
        env:
        - name: ADDRESS
          value: /var/lib/csi/sockets/pluginproxy/csi.sock
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: registry.k8s.io/sig-storage/csi-provisioner:v3.5.0
        imagePullPolicy: IfNotPresent
        name: csi-provisioner
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/csi/sockets/pluginproxy/
          name: socket-dir

Are there any particular topology keys I should use for compatibility with VolSync? Is the ZFS-LocalPV Helm chart's default "All" value a valid key?

One more question: When you create your original sourcePVC and then run your application pod, do you also need to manually configure that pod to run on a particular node that corresponds to where the PVC was provisioned?

I think you're referring to statically provisioned PVCs here? If so, I'm not using those, so I'm not sure. All of the PVCs I'm trying to use as source PVCs for VolSync are dynamically provisioned as part of a StatefulSet or similar, and therefore Kubernetes creates the PVC on the same node where its pod will run.

tesshuflower commented 8 months ago

@dhess there's nothing specific in VolSync that you should need to do to ensure compatibility. I guess normally I'd expect that the first consumer (the volsync mover pod in this case) of a PVC should get automatically scheduled on a node where that pvc is accessible. It sounds like this is happening with your statefulset for example.

Maybe you could try something to help me understand - If you create a volumesnapshot for one of your source PVCs and then create a PVC from this snapshot (or do a clone instead of volumesnapshot+pvc if you're using copymethod Clone) - Can you then create a job or deployment that mounts this PVC without specifically needing to set affinity to schedule it on a particular node?

dhess commented 8 months ago

Ahh, I see what you mean now. I'll try an experiment and get back to you.

danielsand commented 8 months ago

👋 this issue also happens with CSI democrati-csi local-hostpath using the volsync volumepopulator.

https://github.com/democratic-csi/democratic-csi/issues/329

seems to be a time based racecondition.

tesshuflower commented 8 months ago

👋 this issue also happens with CSI democrati-csi local-hostpath using the volsync volumepopulator.

democratic-csi/democratic-csi#329

seems to be a time based racecondition.

@danielsand I don't think this issue was specifically about the volumepopulator - would you be able to explain the scenario where you're hitting the issue?

dhess commented 5 months ago

So since I originally posted this issue, VolSync snapshots with ZFS-LocalPV have been working pretty reliably. However, we just ran into the issue (or at least a similar one) again, and I think it's possible that I misdiagnosed the original problem.

This time what happened is:

We added some new worker nodes to the cluster, and they have each have a few ZFS-LocalPV storage classes defined.
Shortly thereafter, one of our VolSync jobs that backs up a ZFS-LocalPV PVC got stuck in scheduling, complaining that 0 nodes were available. This VolSync job had previously been working reliably for about a month.
When I looked more carefully at the root cause, I noticed that while the Clone PVC was correctly created on the same node as the source ZFS-LocalPV PVC, the cache PVC was not — it was being created on one of the new worker nodes. Since ZFS-LocalPV volumes can't be mounted across the network, the ReplicationSource job was getting stuck on the remote ZFS-LocalPV cache PVC.

The ReplicationSource job originally looked like this:

---
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: db-primer-service-0
spec:
  sourcePVC: db-primer-service-0
  trigger:
    # 1 backup per hour
    schedule: "30 * * * *"
  restic:
    cacheStorageClassName: zfspv-pool-0
    copyMethod: Clone
    pruneIntervalDays: 7
    repository: restic-config-db-primer-service-0
    retain:
      hourly: 24
      daily: 7
      weekly: 1
    volumeSnapshotClassName: zfspv-snapclass

where zfspv-pool-0 is the same ZFS-LocalPV storage class as the source volume.

In the last few months we've also added support for Mayastor to our cluster, and those PVCs are not tied to a particular node, so when I changed the cache storage class to Mayastor, the backup job ran and completed successfully:

---
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: db-primer-service-0
spec:
  sourcePVC: db-primer-service-0
  trigger:
    # 1 backup per hour
    schedule: "30 * * * *"
  restic:
    cacheStorageClassName: mayastor-pool-0-repl-1
    copyMethod: Clone
    pruneIntervalDays: 7
    repository: restic-config-db-primer-service-0
    retain:
      hourly: 24
      daily: 7
      weekly: 1
    volumeSnapshotClassName: zfspv-snapclass

So I think that the problem here isn't with the source volume, but with the cache volume. I suspect that in order to reliably use a local PV storage class for cache volumes, there'll need to be some way to specify the topology of that volume.

What's still puzzling is that all of our other cacheStorageClassNames also specify a ZFS-LocalPV storage class, and this is the first time I've seen a stuck job in awhile. Why this suddenly popped up again after adding some new nodes is curious. Maybe the scheduler is trying to balance out the number of PVCs across the new nodes?

tesshuflower commented 5 months ago

@dhess is your storageclass using a VolumeBindingMode of WaitForFirstConsumer? VolSync doesn't create the cache PVC until just before creating the job, so normally I think it should be figured out in the scheduling - unless you're using a VolumeBindingMode of Immediate, in which case the PVC could be bound to a node that isn't the same one as your pvc from snap.

danielsand commented 4 months ago

👋 this issue also happens with CSI democrati-csi local-hostpath using the volsync volumepopulator. democratic-csi/democratic-csi#329 seems to be a time based racecondition.

@danielsand I don't think this issue was specifically about the volumepopulator - would you be able to explain the scenario where you're hitting the issue?

The linked issue wasnt about the volumepopulator, democrati csi local-hostpath + volume snapshots + volsync didnt worked for some folks.

Just a reference it on what was is currently running on my end and what is working. (CSI and volume snapshots work as they should)

Volumepopulator is failing at random currently on my setup. The wrong node gets picked by the volume populator and WaitForFirstConsumer is specified.

Will circle back when I push the topic again.

tesshuflower commented 4 months ago

👋 this issue also happens with CSI democrati-csi local-hostpath using the volsync volumepopulator. democratic-csi/democratic-csi#329 seems to be a time based racecondition.

@danielsand I don't think this issue was specifically about the volumepopulator - would you be able to explain the scenario where you're hitting the issue?

The linked issue wasnt about the volumepopulator, democrati csi local-hostpath + volume snapshots + volsync didnt worked for some folks.

Just a reference it on what was is currently running on my end and what is working. (CSI and volume snapshots work as they should)

Volumepopulator is failing at random currently on my setup. The wrong node gets picked by the volume populator and WaitForFirstConsumer is specified.

Will circle back when I push the topic again.

@danielsand I've created a separate issue https://github.com/backube/volsync/issues/1255 to track this. I believe both issues are about storage drivers that create volumesnapshots/pvcs that are constrained to specific nodes, but I think your issue is related to using the volumepopulator, and this one is not.

backube / volsync

Node affinity for snapshots #1019