Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] PV cloning fails when source PV is bound to a pod running Windows containers #3468

Open lippertmarkus opened 1 year ago

lippertmarkus commented 1 year ago

Describe the bug PV cloning fails when source PV is bound to a pod running Windows containers. If the source Pod is stopped, it works, however.

To Reproduce Steps to reproduce the behavior:

  1. Create source PVC and Windows Pod: kubectl apply -f pv-src.yml:
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: pvc-azuredisk
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
      storageClassName: managed-csi
    ---
    kind: Pod
    apiVersion: v1
    metadata:
      name: nginx-azuredisk
    spec:
      securityContext:
        windowsOptions:
          runAsUserName: "ContainerAdministrator"
      nodeSelector:
        kubernetes.io/os: windows
      containers:
        - image: mcr.microsoft.com/windows/nanoserver:ltsc2022
          name: nginx-azuredisk
          command:
            - "cmd"
            - "/c"
            - "ping -t 127.0.0.1"
          volumeMounts:
            - name: azuredisk01
              mountPath: "/mnt/azuredisk"
      volumes:
        - name: azuredisk01
          persistentVolumeClaim:
            claimName: pvc-azuredisk
  2. Create clone PVC and Windows pod: kubectl apply -f pv-clone.yml:
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: pvc-azuredisk-cloning
    spec:
      accessModes:
        - ReadWriteOnce
      storageClassName: managed-csi
      resources:
        requests:
          storage: 10Gi
      dataSource:
        kind: PersistentVolumeClaim
        name: pvc-azuredisk
    ---
    kind: Pod
    apiVersion: v1
    metadata:
      name: nginx-restored-cloning
    spec:
      securityContext:
        windowsOptions:
          runAsUserName: "ContainerAdministrator"
      nodeSelector:
        kubernetes.io/os: windows
      containers:
        - image: mcr.microsoft.com/windows/nanoserver:ltsc2022
          name: nginx-azuredisk
          command:
            - "cmd"
            - "/c"
            - "ping -t 127.0.0.1"
          volumeMounts:
            - name: azuredisk-cloning
              mountPath: "/mnt/azuredisk"
      volumes:
        - name: azuredisk-cloning
          persistentVolumeClaim:
            claimName: pvc-azuredisk-cloning
  3. Pod nginx-restored-cloning fails to start with MountVolume.MountDevice failed for volume "pvc-6637d05d-281e-49a3-8caf-f9f965b6f5a7" : rpc error: code = Internal desc = could not format 6(lun: 4), and mount it at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\2ee8ddb68bd3bcc5801a813142445159de5daf816e5f3186d08c0b02cd657fd0\globalmount, failed with rpc error: code = Unknown desc = volume id empty

If the source Pod is stopped before creating the clone, it works, however.

Expected behavior As is works with Linux, Volume cloning should also work when the source Windows pod is still running.

Screenshots

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               3m34s                default-scheduler        Successfully assigned default/nginx-restored-cloning to akswin2000001
  Normal   SuccessfulAttachVolume  3m22s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6637d05d-281e-49a3-8caf-f9f965b6f5a7"
  Warning  FailedMount             91s                  kubelet                  Unable to attach or mount volumes: unmounted volumes=[azuredisk-cloning], unattached volumes=[azuredisk-cloning kube-api-access-tkf8d]: timed out waiting for the condition
  Warning  FailedMount             15s (x9 over 3m16s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-6637d05d-281e-49a3-8caf-f9f965b6f5a7" : rpc error: code = Internal desc = could not format 6(lun: 4), and mount it at \var\lib\kubelet\plugins\kubernetes.io\csi\disk.csi.azure.com\2ee8ddb68bd3bcc5801a813142445159de5daf816e5f3186d08c0b02cd657fd0\globalmount, failed with rpc error: code = Unknown desc = volume id empty

Environment (please complete the following information):

Additional context Same issue with Volume Snapshots

andyzhangx commented 1 year ago

@lippertmarkus it's related to this upstream csi-proxy issue: https://github.com/kubernetes-csi/csi-proxy/issues/287#issuecomment-1408522100, so if you mount that disk pv to other node, could it be recognized?

lippertmarkus commented 1 year ago

you're right, it works when the nodes differ

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

andyzhangx commented 1 year ago

would be fixed by https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/1909

andyzhangx commented 1 year ago

updated: this issue was fixed in upstream azure disk csi driver v1.28.2, and would be fixed only on AKS 1.27 with this v1.28.2 version rollout in AKS 0813 or 0820 release, we will try to catch up with 0813 release.