Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 297 forks source link

[BUG] CSI snapshot controller may break whole subscription if ShareSnapshotCountExceeded #4394

Open jkroepke opened 3 weeks ago

jkroepke commented 3 weeks ago

Describe the bug We are using velero together with CSI Snapshots. We also do CSI Snapshot for Azure File

Turns if ShareSnapshotCountExceeded, the csi snapshot controller tries infinitely to create new snapshots.

After some days (because velero create more snapshot requests), the amount of requests hit the Storage Provider Requests Limits (https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling#storage-throttling), e.g. more than 800 requests per 5 minutes. This affects all storage accounts operations.

To Reproduce Steps to reproduce the behavior:

  1. Setup more than 200 CSI snapshots requests again one PVC.

Expected behavior The CSI snapshot controller shouldn't create such huge amount of requests. Since the controller runs inside of AKS controller plane, we are unable to get logs from it.

The csi snapshot controller should also recognized that 200 PVC Snapshots for one PVC may exists and should correctly report the error on the kubernetes events.

Screenshots Bildschirmfoto 2024-07-10 um 13 22 55 347379004-0494c3d2-9455-4d5e-b624-1a13942b50d3

Environment (please complete the following information):

Additional context Add any other context about the problem here.

andyzhangx commented 3 weeks ago

can you email me your aks cluster fqdn? I could take a look.

andyzhangx commented 3 weeks ago

pls add useDataPlaneAPI: "true" into snapshot storage class? I think that would solve the problem.

---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: csi-azurefile-vsc
driver: file.csi.azure.com
parameters:
  useDataPlaneAPI: "true"  # unlimited azure file api call
deletionPolicy: Delete

useDataPlaneAPI: specify whether use data plane API for file share create/delete/resize, this could solve the SRP API throttling issue since data plane API has almost no limit, while it would fail when there is firewall or vnet setting on storage account

jkroepke commented 3 weeks ago

The Storage Account is private and we have a private link enabled to the VNET. Is it expected to fail?

Edit: Reading https://github.com/kubernetes-sigs/azurefile-csi-driver/issues/1687 it seems like that.

I guess VNET API Server Integration wont help here?

I hope that AKS will added to the Azure trusted services in the future.

andyzhangx commented 3 weeks ago

that depends on your network setting of your storage account, if it's "Selected network ...", then useDataPlaneAPI won't work.

jkroepke commented 3 weeks ago

that depends on your network setting of your storage account, if it's "Selected network ...", then useDataPlaneAPI won't work.

Is there an documented ip range that I can configure?