Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[BUG] Dynamic volume provisioning with blob storage fails when using private endpoint #4242

Closed Laffs2k5 closed 2 weeks ago

Laffs2k5 commented 3 weeks ago

Describe the bug

We have been utilizing dynamic volume provisioning for our deployments in AKS for a while. This is realized using the Azure Blob Storage CSI driver for Kubernetes with a custom sc (StorageClass) and pvc (PersistentVolumeClaim) in accordance with the documentation found here: Create a persistent volume with Azure Blob storage in Azure Kubernetes Service (AKS).

This week it stopped working with an error message during provisioning of the sc:

failed to provision volume with StorageClass "sc-blob-storage-test-application-pr-203": rpc error: code = Internal desc = ensure storage account failed with create private DNS zone(privatelink.blob.core.windows.net) in resourceGroup(rg6-ss2-cm-net-dev): authenticated requests are not permitted for non TLS protected (https) endpoints

The issue occurs for all deployments with new pvc's. Deployments with already existing pvc's continue to run fine. the result is that we are prevented from deploying any new apps with storage backed by dynamic volume provisioning.

The first part of the description, ensure storage account failed with, is logged from blob-csi-driver/pkg/blob/controllerserver.go L386.

The next part, create private DNS zone(privatelink.blob.core.windows.net) in resourceGroup(rg6-ss2-cm-net-dev):, is logged from cloud-provider-azure/pkg/provider/azure_storageaccount.go L247, this code is the out-of-tree cloud provider for Azure.

NOTE: code execution should never have gotten to line 247 in our case, as the private link DNS zone already exist and thus should have been retrieved on line 243 by az.privatednsclient.Get(). To understand the failure better I would very much like to known what is logged by line 244, klog.V(2).Infof("get private dns zone %s returned with %v", privateDNSZoneName, err.Error()). But I don't know where or how to enable this logging.

I've looked through the kube-system logs and identified that the csi-blob-node daemonset was updated to use the image mcr.microsoft.com/oss/kubernetes-csi/blob-csi:v1.23.4 in-place of mcr.microsoft.com/oss/kubernetes-csi/blob-csi:v1.23.3 at 6AM April 23rd. We have logs indicating that the dynamic provisioning worked fine April 22nd and developers reported the first issue dynamic provisioning April 24th.

NOTE: we have not touched or modified the csi-blob-node daemonset ourselves, this is fully manged by AKS.

The cluster in question is on AKS 1.28.5, and the update to version 1.23.4 of the Blob Storage CSI driver corresponds well with the AKS Release 2024-04-11 which under Component Updates lists:

  • Upgraded Azure Blob CSI driver to v1.23.4 on AKS 1.28 and 1.29

If you are able to reproduce on your side, I would suggest to downgrade the Blob Storage CSI driver to version 1.23.3 until the cause of the failure is understood and mitigated.

To Reproduce

  1. Have the private DNS zone privatelink.blob.core.windows.net already existing in the resource group of the AKS vnet.
  2. Assign the AKS identity Contributor role in the resource group hosting the AKS vnet.
  3. Have the blob storage CSI driver enabled: az aks update --enable-blob-driver
  4. Deploy sc + pvc + deployment to AKS (see examples of our declarations under Additional context below)

Expected behavior

  1. Storage account + blob is created
  2. Private DNS zone is updated with a record for the blob
  3. The AKS deployment provisions successfully

Screenshots

Not applicable.

Environment

Additional context

custom StorageClass ```yaml apiVersion: storage.k8s.io/v1 kind: StorageClass allowVolumeExpansion: true metadata: labels: name: sc-blob-storage-test-application-pr-203 mountOptions: - '-o allow_other' - '--file-cache-timeout-in-seconds=120' - '--use-attr-cache=true' - '--cancel-list-on-mount-seconds=10' - '-o attr_timeout=120' - '-o entry_timeout=120' - '-o negative_timeout=120' - '--log-level=LOG_WARNING' - '--cache-size-mb=1000' parameters: allowBlobPublicAccess: 'false' containerName: test-application-pr-203 matchTags: 'true' networkEndpointType: privateEndpoint protocol: fuse2 skuName: Standard_LRS tags: >- CreatedBy=Azure Kubernetes Service,ApplicationName=AKS application storage,Description=Storage account for ephemeral environments in AKS provisioner: blob.csi.azure.com reclaimPolicy: Delete volumeBindingMode: Immediate ```
PersistentVolumeClaim ```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-class: sc-blob-storage-test-application-pr-203 name: pvc-blob-storage-test-application-pr-203 namespace: test-application-pr-203 spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi storageClassName: sc-blob-storage-test-application-pr-203 ```
Deployment (not a complete example, just the important parts) ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: test-application-pr-203-deployment namespace: test-application-pr-203 spec: replicas: 1 selector: matchLabels: app: test-application-pr-203-app template: metadata: labels: app: test-application-pr-203-app app.kubernetes.io/name: test-application-backend app.kubernetes.io/part-of: test-application app.kubernetes.io/version: pr-203-2024.04.25.53165 azure.workload.identity/use: 'false' spec: containers: - env: - name: AZURE_PERSISTENT_STORAGE_MOUNT_PATH value: /mnt/azure-blob-storage image: /test-application-pr-203:pr-203-2024.04.25.53165 name: test-application-pr-203-container volumeMounts: - mountPath: /mnt/azure-blob-storage name: blob-storage readOnly: false volumes: - name: blob-storage persistentVolumeClaim: claimName: pvc-blob-storage-test-application-pr-203 ```
andyzhangx commented 3 weeks ago

pls email me your cluster info, we could upgrade your blob csi driver in backend to v1.24.1 which has the fix, thanks

ragatgen commented 3 weeks ago

This is also happening with azure file

Answer: Pod failing to schedule due to the error : file.csi.azure.com_csi-azurefile-controller-7786ddb966-bz9hn_4902be80-5e92-4246-82d3-94709310755e failed to provision volume with StorageClass 'trun-azurefile': rpc error: code = Internal desc = failed to ensure storage account: create private DNS zone(privatelink.file.core.windows.net) in resourceGroup(Test3SpInDevEnvRG): authenticated requests are not permitted for non TLS protected (https) endpoints

andyzhangx commented 3 weeks ago

pls upgrade to aks 1.29 which already have the fix, if you don't want to upgrade cluster version, pls email me your cluster info, I will help upgrade your azure file driver version in the backend, thanks.

Laffs2k5 commented 3 weeks ago

Ok thanks. I'll upgrade our clusters to 1.29 👍

microsoft-github-policy-service[bot] commented 2 weeks ago

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.