NetApp / trident

Storage orchestrator for containers
Apache License 2.0
732 stars 218 forks source link

ONTAP ephemeral snapshot remains in backend when doing PVC->PVC CSI clone #901

Open akalenyu opened 2 months ago

akalenyu commented 2 months ago

Describe the bug A CSI clone works by creating an ephemeral snapshot behind the scenes. This snapshot is not cleaned up potentially resulting in hitting the snapshot limit for FSx: https://docs.aws.amazon.com/fsx/latest/ONTAPGuide/snapshots-ontap.html (1023 per volume)

Environment Provide accurate information about the environment to help us reproduce the issue.

To Reproduce

Expected behavior Ephemeral snapshot removed

Additional context Snapshot remains:

FsxIdxxxxxxx::> vol snapshot show
                                                                 ---Blocks---
Vserver  Volume   Snapshot                                  Size Total% Used%
-------- -------- ------------------------------------- -------- ------ -----
svm1     svm1_root
                  hourly...
         trident_pvc_3df42396_be95_4f94_a7f9_f97bb39c5659
                  20240418T141852Z                         176KB     0%   36%
...
akalenyu commented 2 months ago

/cc @aglitke @clintonk This becomes an issue with OpenShift Virt where we use the concept of golden images; most VMs are created from a single golden OS flavor image But could also be a problem when mass cloning VM disks from a preset VM.

wonderland commented 2 months ago

This is in part related to how FSxN (or Ontap in general) works. By default, the source volume and its clone will have a connection between them, making the clone faster and more space efficient. So while both PVCs are independent objects in K8s, they do not necessarily have to be on the storage side. However, Trident gives you control over this.

I see two main options to deal with this:

Create a "golden snapshot"

With this approach, you not only have a golden image - which still is a read-write volume and therefore could potentially change without you knowing. You also create a "golden snapshot" from that golden image PVC. That freezes the current PVC state, making it immutable so you always clone off the exact same state. In addition, you only have exactly one snapshot and clone off the snapshot rather than the mutable PVC (e.g. Trident won't create an additional snapshot in this case). Potentially you can have more snapshots in the future as you can also use this for "versioning" of the golden image, e.g. make necessary changes to the PVC, then snapshot again. For each clone you can then either use the old snapshot, resembling the previous golden image state, or the new snapshot, resembling the modified state. Therefore my preferred approach.

You create a golden snap like this:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot 
metadata: 
  name: golden-snap1
spec: 
  volumeSnapshotClassName: ontap-snaps 
  source: 
    persistentVolumeClaimName: simple-pvc

Then create a clone from that golden snapshot like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
    name: clone
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: sc-nas-svm1
  resources:
    requests:
      storage: 1Gi
  dataSource:
    kind: VolumeSnapshot
    name: golden-snap1
    apiGroup: snapshot.storage.k8s.io

Instruct Trident to "split" the clone

Trident will keep the relationship between source and clone intact by default - resulting in the extra snapshot you noted. However, you can change this behavior with either an annotation on the source PVC or by setting the respective option in your Trident backend configuration. When setting splitOnClone to false, Trident will fully decouple source and clone volume, leaving no extra snapshot behind. As an example, setting this annotation on your golden PVC would look like this:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test3
  annotations:
      trident.netapp.io/splitOnClone: "true"
spec:
  storageClassName: sc-nas-svm1
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
aglitke commented 2 months ago

@wonderland Thanks for providing these details. The PR that @akalenyu referenced here changes CDI behavior for trident serviced storage classes to do just as you suggest. We agree that an immutable golden image is desirable for its own merits. Without your trident-specific annotation on the cloned PVC will we still encounter problems if the golden image snapshot is deleted? The golden images are updated periodically by an automated process and we employ garbage collection of older versions in order to better manage our storage usage. In this case there could still be VMs which were cloned from a golden image snapshot that is a garbage collection candidate.

We would like to avoid using vendor-specific annotations on the PVCs we create on behalf of the user.