ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

rbd plugin fails to unmap device on worker nodes #3873

Closed sbskas closed 1 year ago

sbskas commented 1 year ago

Describe the bug

ceph-csi-node plugin fails to unmap rbd device from worker where volume are unstaged.

Environment details

Steps to reproduce

Steps to reproduce the behavior:

  1. Setup details: Create a PV and use it for a while.
  2. Deployment to trigger the issue: At pod deletion filesystem is not unmounted and rdb device is unmappable.
  3. See error

Actual results

Volumeattachment from previous node is still present, pv is still mapped and fs is mounted. fs is unmoutable without any issue but is not unmount.

mount |grep /dev/rbd0
/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5ac0ed17-290c-4675-acac-9de4a01320ae/globalmount/0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 type xfs (rw,relatime,nouuid,attr2,discard,inode64,sunit=8192,swidth=8192,noquota,_netdev)

New pod is unable to attache volume and thus unable to start.

Expected behavior

pvc FS should be umounted rdb should be unmapped correctly. Previous Volumeattachment should be delete

Logs

Please find an example from production Here an elasticsearch pv is being deleted after cluster decommissionning.

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rbd.csi.ceph.com
  creationTimestamp: "2020-05-14T22:31:17Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-provisioner.volume.kubernetes.io/finalizer
  name: pvc-5ac0ed17-290c-4675-acac-9de4a01320ae
  resourceVersion: "1327277445"
  uid: 618ec7d5-b0ed-4998-9ab4-45b90690d6ea
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 5Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: storage-elasticsearch-master-0
    namespace: elk-cluster
    resourceVersion: "303589553"
    uid: 5ac0ed17-290c-4675-acac-9de4a01320ae
  csi:
    driver: rbd.csi.ceph.com
    fsType: xfs
    nodeStageSecretRef:
      name: csi-rbd-publish-secret
      namespace: kube-system
    volumeAttributes:
      clusterID: c1d912c9-562f-40a3-85fd-4c7f3a97251f
      imageFeatures: layering
      imageFormat: "2"
      pool: block
      storage.kubernetes.io/csiProvisionerIdentity: 1589487732757-8081-rbd.csi.ceph.com
    volumeHandle: 0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19
  mountOptions:
  - discard
  persistentVolumeReclaimPolicy: Delete
  storageClassName: csi-rbd
  volumeMode: Filesystem
status:
  phase: Released

Describe PV:

Name:            pvc-5ac0ed17-290c-4675-acac-9de4a01320ae
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: rbd.csi.ceph.com
Finalizers:      [kubernetes.io/pv-protection external-provisioner.volume.kubernetes.io/finalizer]
StorageClass:    csi-rbd
Status:          Released
Claim:           elk-cluster/storage-elasticsearch-master-0
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        5Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            rbd.csi.ceph.com
    FSType:            xfs
    VolumeHandle:      0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19
    ReadOnly:          false
    VolumeAttributes:      clusterID=c1d912c9-562f-40a3-85fd-4c7f3a97251f
                           imageFeatures=layering
                           imageFormat=2
                           pool=block
                           storage.kubernetes.io/csiProvisionerIdentity=1589487732757-8081-rbd.csi.ceph.com
Events:
  Type     Reason              Age                  From                                                                                                      Message
  ----     ------              ----                 ----                                                                                                      -------
  Warning  VolumeFailedDelete  48s (x3742 over 9d)  rbd.csi.ceph.com_ceph-csi-ceph-csi-rbd-provisioner-7696f45484-rkz7b_4a91dc2c-80ea-45ce-8062-efa978792205  rpc error: code = Internal desc = rbd csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19 is still being used

By looking at the volumeAttachment:

Name:         csi-7c566884d9a1a1485f0edee0576e49f089dad7fd4c1d468cf0a2ed9cdc974f60
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  storage.k8s.io/v1
Kind:         VolumeAttachment
Metadata:
  Creation Timestamp:  2022-09-20T18:22:20Z
  Managed Fields:
    API Version:  storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:attached:
    Manager:      Go-http-client
    Operation:    Update
    Subresource:  status
    Time:         2022-09-20T18:22:20Z
    API Version:  storage.k8s.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:attacher:
        f:nodeName:
        f:source:
          f:persistentVolumeName:
    Manager:         kube-controller-manager
    Operation:       Update
    Time:            2022-09-20T18:22:20Z
  Resource Version:  1103890615
  UID:               64ed3d23-f606-4564-93f8-0a74886ba7ee
Spec:
  Attacher:   rbd.csi.ceph.com
  Node Name:  node13-pic-pr
  Source:
    Persistent Volume Name:  pvc-5ac0ed17-290c-4675-acac-9de4a01320ae
Status:
  Attached:  true
Events:      <none>

Looking at node13 rbdplugin logs indeed shows the problem:

csi-rbdplugin I0601 07:51:21.602840    8497 utils.go:195] ID: 1914438 GRPC call: /csi.v1.Identity/Probe
csi-rbdplugin I0601 07:51:21.602894    8497 utils.go:206] ID: 1914438 GRPC request: {}
csi-rbdplugin I0601 07:51:21.602919    8497 utils.go:212] ID: 1914438 GRPC response: {}
csi-rbdplugin I0601 07:52:02.566016    8497 utils.go:195] ID: 1914439 GRPC call: /csi.v1.Node/NodeGetCapabilities
csi-rbdplugin I0601 07:52:02.566062    8497 utils.go:206] ID: 1914439 GRPC request: {}
csi-rbdplugin I0601 07:52:02.566154    8497 utils.go:212] ID: 1914439 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
csi-rbdplugin I0601 07:52:02.566699    8497 utils.go:195] ID: 1914440 GRPC call: /csi.v1.Node/NodeGetVolumeStats
csi-rbdplugin I0601 07:52:02.566751    8497 utils.go:206] ID: 1914440 GRPC request: {"volume_id":"0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-ab402f70-9624-11ea-a13c-fea68445fd19","volume_path":"/var/lib/kubelet/pods/ea78a831-b1f5-4864-ac94-debb37de6106/volumes/kubernetes.io~csi/pvc-c946184c-8748-4e69-96ce-7f85b6175dd6/mount"}
csi-rbdplugin I0601 07:52:02.566843    8497 utils.go:212] ID: 1914440 GRPC response: {"usage":[{"available":34961813504,"total":53660876800,"unit":1,"used":18699063296},{"available":26212216,"total":26214400,"unit":2,"used":2184}]}
csi-rbdplugin I0601 07:52:21.648195    8497 utils.go:195] ID: 1914441 GRPC call: /csi.v1.Identity/Probe
csi-rbdplugin I0601 07:52:21.648294    8497 utils.go:206] ID: 1914441 GRPC request: {}
csi-rbdplugin I0601 07:52:21.648333    8497 utils.go:212] ID: 1914441 GRPC response: {}
csi-rbdplugin I0601 07:52:25.505730    8497 utils.go:195] ID: 1914442 GRPC call: /csi.v1.Node/NodeGetCapabilities
csi-rbdplugin I0601 07:52:25.505772    8497 utils.go:206] ID: 1914442 GRPC request: {}
csi-rbdplugin I0601 07:52:25.505850    8497 utils.go:212] ID: 1914442 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
csi-rbdplugin I0601 07:52:25.506479    8497 utils.go:195] ID: 1914443 GRPC call: /csi.v1.Node/NodeGetVolumeStats
csi-rbdplugin I0601 07:52:25.506518    8497 utils.go:206] ID: 1914443 GRPC request: {"volume_id":"0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-72745474-9622-11ea-a13c-fea68445fd19","volume_path":"/var/lib/kubelet/pods/028c1fa9-6883-41f5-8e6b-4fbf553aed28/volumes/kubernetes.io~csi/pvc-4e92600a-42d8-444f-b6ba-ca9bef3490fa/mount"}
csi-rbdplugin I0601 07:52:25.506616    8497 utils.go:212] ID: 1914443 GRPC response: {"usage":[{"available":53618704384,"total":53660876800,"unit":1,"used":42172416},{"available":26212216,"total":26214400,"unit":2,"used":2184}]}
csi-rbdplugin I0601 07:52:30.487879    8497 utils.go:195] ID: 1914444 GRPC call: /csi.v1.Node/NodeGetCapabilities
csi-rbdplugin I0601 07:52:30.487943    8497 utils.go:206] ID: 1914444 GRPC request: {}
csi-rbdplugin I0601 07:52:30.488062    8497 utils.go:212] ID: 1914444 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
csi-rbdplugin I0601 07:52:30.488904    8497 utils.go:195] ID: 1914445 GRPC call: /csi.v1.Node/NodeGetVolumeStats
csi-rbdplugin I0601 07:52:30.488948    8497 utils.go:206] ID: 1914445 GRPC request: {"volume_id":"0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-4070addc-bf81-11ea-baff-769a641e01fa","volume_path":"/var/lib/kubelet/pods/c7031226-9d88-491b-a122-0e669c30309e/volumes/kubernetes.io~csi/pvc-3b860149-627b-40e2-84aa-bca31382eeff/mount"}
csi-rbdplugin I0601 07:52:30.489053    8497 utils.go:212] ID: 1914445 GRPC response: {"usage":[{"available":18701361152,"total":134191513600,"unit":1,"used":115490152448},{"available":36526886,"total":36527888,"unit":2,"used":1002}]}
csi-rbdplugin I0601 07:53:13.691568    8497 utils.go:195] ID: 1914446 GRPC call: /csi.v1.Node/NodeGetCapabilities
csi-rbdplugin I0601 07:53:13.691627    8497 utils.go:206] ID: 1914446 GRPC request: {}
csi-rbdplugin I0601 07:53:13.691739    8497 utils.go:212] ID: 1914446 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
csi-rbdplugin I0601 07:53:13.694409    8497 utils.go:195] ID: 1914447 GRPC call: /csi.v1.Node/NodeGetVolumeStats
csi-rbdplugin I0601 07:53:13.694476    8497 utils.go:206] ID: 1914447 GRPC request: {"volume_id":"0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-ab402f70-9624-11ea-a13c-fea68445fd19","volume_path":"/var/lib/kubelet/pods/ea78a831-b1f5-4864-ac94-debb37de6106/volumes/kubernetes.io~csi/pvc-c946184c-8748-4e69-96ce-7f85b6175dd6/mount"}
csi-rbdplugin I0601 07:53:13.694598    8497 utils.go:212] ID: 1914447 GRPC response: {"usage":[{"available":34961813504,"total":53660876800,"unit":1,"used":18699063296},{"available":26212216,"total":26214400,"unit":2,"used":2184}]}
csi-rbdplugin I0601 07:53:21.612403    8497 utils.go:195] ID: 1914448 GRPC call: /csi.v1.Identity/Probe
csi-rbdplugin I0601 07:53:21.612479    8497 utils.go:206] ID: 1914448 GRPC request: {}
csi-rbdplugin I0601 07:53:21.612513    8497 utils.go:212] ID: 1914448 GRPC response: {}
csi-rbdplugin I0601 07:53:22.884139    8497 utils.go:195] ID: 1914449 GRPC call: /csi.v1.Node/NodeGetCapabilities
csi-rbdplugin I0601 07:53:22.884203    8497 utils.go:206] ID: 1914449 GRPC request: {}
csi-rbdplugin I0601 07:53:22.884322    8497 utils.go:212] ID: 1914449 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}},{"Type":{"Rpc":{"type":5}}}]}
csi-rbdplugin I0601 07:53:22.885070    8497 utils.go:195] ID: 1914450 Req-ID: 0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 GRPC call: /csi.v1.Node/NodeUnstageVolume
csi-rbdplugin I0601 07:53:22.885116    8497 utils.go:206] ID: 1914450 Req-ID: 0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/rbd.csi.ceph.com/134a636c21d410463f2527fae0dbdabd457280491fe6000d016dffd80c38c4de/globalmount","volume_id":"0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19"}
csi-rbdplugin I0601 07:53:23.176015    8497 cephcmds.go:98] ID: 1914450 Req-ID: 0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 an error (exit status 16) occurred while running rbd args: [unmap block/csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19 --device-type krbd --options noudev]
csi-rbdplugin E0601 07:53:23.176066    8497 nodeserver.go:992] ID: 1914450 Req-ID: 0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 error unmapping volume (0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19) from staging path (/var/lib/kubelet/plugins/kubernetes.io/csi/rbd.csi.ceph.com/134a636c21d410463f2527fae0dbdabd457280491fe6000d016dffd80c38c4de/globalmount/0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19): (rbd: unmap for spec (block/csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19) failed (an error (exit status 16) occurred while running rbd args: [unmap block/csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19 --device-type krbd --options noudev]): (rbd: sysfs write failed
csi-rbdplugin rbd: unmap failed: (16) Device or resource busy
csi-rbdplugin ))
csi-rbdplugin E0601 07:53:23.176096    8497 utils.go:210] ID: 1914450 Req-ID: 0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 GRPC error: rpc error: code = Internal desc = rbd: unmap for spec (block/csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19) failed (an error (exit status 16) occurred while running rbd args: [unmap block/csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19 --device-type krbd --options noudev]): (rbd: sysfs write failed
csi-rbdplugin rbd: unmap failed: (16) Device or resource busy
csi-rbdplugin )

On the node13, volume is still mounted:

 mount |grep a010aab6-9632-11ea-a13c-fea68445fd19
/dev/rbd0 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-5ac0ed17-290c-4675-acac-9de4a01320ae/globalmount/0001-0024-c1d912c9-562f-40a3-85fd-4c7f3a97251f-0000000000000002-a010aab6-9632-11ea-a13c-fea68445fd19 type xfs (rw,relatime,nouuid,attr2,discard,inode64,sunit=8192,swidth=8192,noquota,_netdev)

 rbd showmapped
id pool  namespace image                                        snap device    
0  block           csi-vol-a010aab6-9632-11ea-a13c-fea68445fd19 -    /dev/rbd0 
1  block           csi-vol-ab402f70-9624-11ea-a13c-fea68445fd19 -    /dev/rbd1 
2  block           csi-vol-4070addc-bf81-11ea-baff-769a641e01fa -    /dev/rbd2 
4  block           csi-vol-72745474-9622-11ea-a13c-fea68445fd19 -    /dev/rbd4 

umounting the fs is ok.
rbd unmap failed with error 16. lsof show no usage. We had to rdb unmap -o force to release the PV.

This behaviour is present from nautilus to quincy. We browsed previous issues regarding multipathd usage and so on with no luck.

Problems:

  1. maybe a logic problem. All mount from this fs should be unmounted before unmapping the device.
  2. rbd device is not unmappable. And we haven't got any clue why.
nixpanic commented 1 year ago

The logs show /csi.v1.Node/NodeUnstageVolume call that failed because the filesystem is keeping the RBD-device open. There should have been a /csi.v1.Node/NodeUnpublishVolume call somewhere too, that call triggers the unmounting of the filesystem.

Could you share the complete csi-rbdplugin container logs from the node where the problem occured? Is this something that happened once, or does it happen constantly?

sbskas commented 1 year ago

Here's the rbdcsiplugin log. It happens quite frequently.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

bg502 commented 5 months ago

Any updates on this? Faced the same issue after upgrading k8s from 1.23 to 1.24.