Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.96k stars 305 forks source link

[BUG] VolumeSnapshot takes very long to become `readyToUse` #4555

Open ricoberger opened 1 week ago

ricoberger commented 1 week ago

Describe the bug

When we create a new VolumeSnapshot in our clusters it takes at least 8 minutes before the VolumeSnapshot readyToUse field is set to true. When we check the snapshot in the Azure portal it is created immediately and it looks like just the update of the Kubernetes resource takes so long.

This causes problems with Velero (our backup solution) because it has to wait until the VolumeSnapshot becomes readyToUse, before it proceeds with the next backup. With a lot of Volumes the backups are failing then.

We see this issue since the beginning of September.

To Reproduce

  1. Create a new VolumeSnapshot:
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: test-snapshot
spec:
  volumeSnapshotClassName: csi-azuredisk-vsc
  source:
    persistentVolumeClaimName: pgdata-alarm-postgres-db-0
  1. Watch the created VolumeSnapshot: k get volumesnapshots.snapshot.storage.k8s.io -w
NAME            READYTOUSE   SOURCEPVC                    SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS       SNAPSHOTCONTENT   CREATIONTIME   AGE
test-snapshot                pgdata-alarm-postgres-db-0                                         csi-azuredisk-vsc                                    0s
test-snapshot                pgdata-alarm-postgres-db-0                                         csi-azuredisk-vsc                                    0s
test-snapshot   false        pgdata-alarm-postgres-db-0                                         csi-azuredisk-vsc   snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb                  0s
test-snapshot   true         pgdata-alarm-postgres-db-0                           1Gi           csi-azuredisk-vsc   snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb   8m23s          8m24s
  1. Watch the created VolumeSnapshotContent k get volumesnapshotcontents.snapshot.storage.k8s.io snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb -w
NAME                                               READYTOUSE   RESTORESIZE   DELETIONPOLICY   DRIVER               VOLUMESNAPSHOTCLASS   VOLUMESNAPSHOT   VOLUMESNAPSHOTNAMESPACE   AGE
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb                              Retain           disk.csi.azure.com   csi-azuredisk-vsc     test-snapshot    alarm                     29s
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb   true         1073741824    Retain           disk.csi.azure.com   csi-azuredisk-vsc     test-snapshot    alarm                     8m24s
snapcontent-ae19d3d9-aafd-42d8-bee6-6159fc2648cb   true         1073741824    Retain           disk.csi.azure.com   csi-azuredisk-vsc     test-snapshot    alarm                     8m24s
  1. The VolumeSnapshot and VolumeSnapshotContent is set to readyToUse after 8 minutes.

Expected behavior

VolumeSnapshot is updated faster, so that Velero can complete the backups.

Environment (please complete the following information):

ricoberger commented 1 week ago

Maybe it really takes so long to create the snapshot. In the logs of the csi-azuredisk-controller we can see that it also takes a long time until the completionPercent field is 100% for a volume with the size of 1GB:

I0920 13:02:19.920424       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=477.629872031 request="azuredisk_csi_driver_controller_create_snapshot" resource_group="prod-us1-nodepool" subscription_id="<SUBSCRIPTION-ID>" source="disk.csi.azure.com" source_resource_id="/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/disks/pvc-6b10115b-4b15-4ec7-b44f-a1522e72947b" snapshot_name="snapshot-70caa786-efed-4f49-8737-95620c584619" result_code="succeeded"
I0920 13:02:19.920471       1 utils.go:84] GRPC response: {"snapshot":{"creation_time":{"nanos":573020200,"seconds":1726836862},"ready_to_use":true,"size_bytes":1073741824,"snapshot_id":"/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/snapshots/snapshot-70caa786-efed-4f49-8737-95620c584619","source_volume_id":"/subscriptions/<SUBSCRIPTION-ID>/resourceGroups/prod-us1-nodepool/providers/Microsoft.Compute/disks/pvc-6b10115b-4b15-4ec7-b44f-a1522e72947b"}}
I0920 13:02:14.937073       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 13:02:19.865503       1 azuredisk.go:493] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) complete
I0920 13:02:04.910054       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 13:02:19.865534       1 controllerserver.go:1003] create snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) region(eastus) successfully
I0920 13:02:09.867724       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:54.882084       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:59.918468       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:04.909481       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:14.882194       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:19.881564       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:24.900656       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:39.897430       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:29.914746       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:44.895578       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:09.873960       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:34.881957       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
I0920 12:56:49.870110       1 azuredisk.go:496] snapshot(snapshot-70caa786-efed-4f49-8737-95620c584619) under rg(prod-us1-nodepool) completionPercent: 0.000000
ricoberger commented 1 week ago

The problem seems to be related to PremiumV2_LRS disks. when we are testing this with Premium_LRS or StandardSSD_LRS disks we can't see this problem. With these disks the snapshot takes round about 3 seconds, no matter how large the disk is:

test-snapshot                                      redis-data-redis-node-0                                         csi-azuredisk-vsc                                                                     0s
test-snapshot                                      redis-data-redis-node-0                                         csi-azuredisk-vsc                                                                     0s
test-snapshot                         false        redis-data-redis-node-0                                         csi-azuredisk-vsc   snapcontent-f5be1583-7bc3-48c9-b089-1696116840fa                  0s
test-snapshot                         true         redis-data-redis-node-0                           1Gi           csi-azuredisk-vsc   snapcontent-f5be1583-7bc3-48c9-b089-1696116840fa   3s             3s

Our StorageClass for the PremiumV2_LRS disks looks as follows:

---
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: default-v2
parameters:
  cachingMode: None
  storageaccounttype: PremiumV2_LRS
  fstype: ext4
  DiskIOPSReadWrite: "3000"
  DiskMBpsReadWrite: "125"
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
umagnus commented 1 week ago

I tested this scenario. The reason is that the snapshot creation of the PremiumV2_LRS disk indeed takes more time. Although it is showed in portal, the Completion percent is not 100 but 0. And after about 8 minutes it changed to 100. CSI will wait for the snapshot to be completed before setting it to ReadyToUse True state. image image

monotek commented 1 week ago

Thats bad as Velero for example waits until a backup is finished until it starts the next one. We have to do +200 PVC snapshots so we can't do our backups in time (every 6 hours) anymore with PremiumV2_LRS :(

kaitimmer commented 1 week ago

I also think that being able to set this option: https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/ec99ab6db2aab7d2754ec88c2a29023e8a1fa475/pkg/azuredisk/azuredisk_option.go#L99

would most likely solve this, right? However, this is not exposed anywhere and needs to be set at the DaemonSet of the CSI Driver, which always gets overwritten by the control plane. Is there a way to test this?