Closed wilmardo closed 4 years ago
@wilmardo can you provide the rbd node plugin logs to find out what happened. btw in the latest canary image, we have fixed some issues related to the mounting failure.
earlier we are not unmounting the rbd device if the mounting fails, this has been fixed now if the pod moves to a different node you can still use the volume
I grabbed the logs from the times above and saw nothing peculiar. So I redeleted the pod but again nothing different from the previous logs. What would you expect to see in the logs? @Madhu-1
I0823 06:14:33.895938 12416 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS
I0823 06:14:33.895979 12416 mount_linux.go:171] systemd-run failed with: exit status 1
I0823 06:14:33.895998 12416 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory
I0823 06:14:33.896069 12416 utils.go:115] GRPC response: {"usage":[{"available":494272512,"total":520785920,"unit":1,"used":26513408},{"available":255996,"total":256000,"unit":2,"used":4}]}
I0823 06:14:57.408770 12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0823 06:14:57.408824 12416 utils.go:110] GRPC request: {}
I0823 06:14:57.410824 12416 utils.go:115] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0823 06:14:57.413907 12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I0823 06:14:57.413943 12416 utils.go:110] GRPC request: {"volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-a86804f9-b8f8-11e9-867b-aa9c979ac781","volume_path":"/var/lib/kubelet/pods/55c58b43-87af-4b65-b0ea-e4c0ebb26597/volumes/kubernetes.io~csi/pvc-4ae1e299-d2b5-4568-a489-a564d6073fbd/mount"}
I0823 06:14:57.417016 12416 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS
I0823 06:14:57.417078 12416 mount_linux.go:171] systemd-run failed with: exit status 1
I0823 06:14:57.417118 12416 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory
I0823 06:14:57.417224 12416 utils.go:115] GRPC response: {"usage":[{"available":5007634432,"total":5358223360,"unit":1,"used":350588928},{"available":2620356,"total":2621440,"unit":2,"used":1084}]}
I0823 06:15:40.245502 12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0823 06:15:40.245555 12416 utils.go:110] GRPC request: {}
I0823 06:15:40.247193 12416 utils.go:115] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0823 06:15:40.250113 12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I0823 06:15:40.250148 12416 utils.go:110] GRPC request: {"volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-a82a306c-b8f8-11e9-867b-aa9c979ac781","volume_path":"/var/lib/kubelet/pods/a2b478c3-3584-45b5-bc79-9cc4c04637b3/volumes/kubernetes.io~csi/pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a/mount"}
I0807 09:48:17.953286 12200 main.go:110] Version: v1.1.0-0-g80a94421
I0807 09:48:17.953366 12200 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0807 09:48:17.953393 12200 connection.go:151] Connecting to unix:///csi/csi.sock
I0807 09:48:26.246243 12200 main.go:127] Calling CSI driver to discover driver name
I0807 09:48:26.246362 12200 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0807 09:48:26.246376 12200 connection.go:181] GRPC request: {}
I0807 09:48:26.271310 12200 connection.go:183] GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"canary"}
I0807 09:48:26.272416 12200 connection.go:184] GRPC error: <nil>
I0807 09:48:26.272439 12200 main.go:137] CSI driver name: "rbd.csi.ceph.com"
I0807 09:48:26.272560 12200 node_register.go:54] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I0807 09:48:26.272735 12200 node_register.go:61] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
I0807 09:48:26.931294 12200 main.go:77] Received GetInfo call: &InfoRequest{}
I0807 09:48:27.096205 12200 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
I was expecting some error message in nodestage volume but am not able to see it.
btw in the latest canary image, we have fixed some issues related to the mounting failure.
I will try this later today, right now I am working on the Helm chart :)
Tried it with the canary but to no avail, still the same error. But I dug some more and found this error in the nodeplugin:
I0824 18:42:17.399798 12416 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
I0824 18:42:17.402761 12416 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: exit status 1
Which led me to that directory and saw:
root@node05:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe# ls -l
ls: cannot access 'globalmount': Input/output error
total 4
d????????? ? ? ? ? ? globalmount
-rw-r--r-- 1 root root 152 Aug 18 18:08 vol_data.json
So the globalmount directory is created funky.
I drained the node and left it drained, redeployed the pod and it got scheduled to another node. There the error seems the same only at another location and this error keeps spamming:
I0824 19:00:47.347245 15873 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0824 19:00:47.349883 15873 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0824 19:00:47.349934 15873 utils.go:109] GRPC error: rpc error: code = NotFound desc = exit status 1
I0824 19:00:47.648840 15873 utils.go:105] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0824 19:00:47.649222 15873 utils.go:106] GRPC request: {"target_path":"/var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount","volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010"}
I0824 19:00:47.649924 15873 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0824 19:00:47.652700 15873 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0824 19:00:47.652810 15873 utils.go:109] GRPC error: rpc error: code = NotFound desc = exit status 1
I0824 19:00:47.951231 15873 utils.go:105] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0824 19:00:47.951287 15873 utils.go:106] GRPC request: {"target_path":"/var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount","volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010"}
I0824 19:00:47.952082 15873 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0824 19:00:47.954777 15873 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0824 19:00:47.954853 15873 utils.go:109] GRPC error: rpc error: code = NotFound desc = exit status 1
root@node06:/var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe# ls -l
ls: cannot access 'mount': Input/output error
total 4
d????????? ? ? ? ? ? mount
-rw-r--r-- 1 root root 339 Aug 12 09:58 vol_data.json
Also found this in the csi-attacher logs:
W0824 18:17:17.551275 1 trivial_handler.go:53] Error saving VolumeAttachment csi-469dcf8ba716aad45a0ee2b982e35265c774c48cf6dc9af048e3ee8e6fcb7e80 as attached: volumeattachments.storage.k8s.io "csi-469dcf8ba716aad45a0ee2b982e35265c774c48cf6dc9af048e3ee8e6fcb7e80" is forbidden: User "system:serviceaccount:storing:ceph-csi-rbd-provisioner" cannot patch resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope
And added patch to the clusterrole rules which fixed it. Should I patch this in #570?
@Madhu-1 Also with the new canary no pods can be recreated:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m59s default-scheduler Successfully assigned downloading/nzbget-67cd57dbbc-rkmz8 to node06
Warning FailedMount 61s kubelet, node06 MountVolume.SetUp failed for volume "pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a" : rpc error: code = Internal desc = missing ID field 'userID' in secrets
Warning FailedMount 56s kubelet, node06 Unable to mount volumes for pod "nzbget-67cd57dbbc-rkmz8_downloading(b0d7d1a4-823f-4607-ac86-a6ee7b721d14)": timeout expired waiting for volumes to attach or mount for pod "downloading"/"nzbget-67cd57dbbc-rkmz8". list of unmounted volumes=[nzbget-config-persistent-storage]. list of unattached volumes=[nzbget-config-persistent-storage downloads-persistent-storage nzbget-token-sctz4]
While the userID is available and the config is working with the latest release. Just did a rollback to the v1.x images and it started working again without changing the config.
Edit:
That is was working when downgraded seemed to be a lucky shot, now it isn't working again (same error message) with the release images. Will try to troubleshoot this further but what I deployed was a working setup before I updated to the canary images.
@wilmardo canary image need node-stage secret to mount the volume, are you referring to the storage class of master branch
This is my current storageclass:
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: csi-rbd-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: rbd.csi.ceph.com
parameters:
clusterID: <id>
pool: kubernetes
mounter: rbd-nbd
imageFeatures: layering
fsType: xfs
imageFormat: "2"
csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
csi.storage.k8s.io/provisioner-secret-namespace: storing
csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
csi.storage.k8s.io/node-stage-secret-namespace: storing
reclaimPolicy: Delete
mountOptions:
- discard
allowVolumeExpansion: true
The secret exists and has valid data:
Name: csi-rbd-secret
Namespace: storing
Labels: <none>
Annotations:
Type: Opaque
Data
====
userID: 10 bytes
userKey: 40 bytes
I tried it again to test the Helm chart but both the v1.1.0 and the canary tag of the quay.io/cephcsi/cephcsi
image aren't working and throwing the same error:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 13m default-scheduler Successfully assigned downloading/nzbget-558564dd95-qsxk5 to node06
Normal SuccessfulAttachVolume 13m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a"
Warning FailedMount 88s (x14 over 13m) kubelet, node06 MountVolume.SetUp failed for volume "pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a" : rpc error: code = Internal desc = missing ID field 'userID' in secrets
Warning FailedMount 36s (x6 over 11m) kubelet, node06 Unable to mount volumes for pod "nzbget-558564dd95-qsxk5_downloading(c72c186b-038d-45f2-9efe-cb6d19438a3e)": timeout expired waiting for volumes to attach or mount for pod "downloading"/"nzbget-558564dd95-qsxk5". list of unmounted volumes=[nzbget-config-persistent-storage]. list of unattached volumes=[nzbget-config-persistent-storage downloads-persistent-storage nzbget-token-6hhfl]
@wilmardo can you paste secret if possible
you have already attached secret,didnt noticed
I think there are several issues within this issue. First the issue in the OP which seems related to this in the node kubelet logs:
Aug 28 13:30:21 node06 kubelet[907]: E0828 13:30:21.345666 907 csi_mounter.go:366] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = NotFound desc = exit status 1
Aug 28 13:30:21 node06 kubelet[907]: E0828 13:30:21.345900 907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\" (\"316d07de-9d07-4916-a595-2b33d966ab93\")" failed. No retries permitted until 2019-08-28 13:32:23.345817343 +0000 UTC m=+1396017.946979634 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"plex-config-persistent-storage\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"316d07de-9d07-4916-a595-2b33d966ab93\" (UID: \"316d07de-9d07-4916-a595-2b33d966ab93\") : rpc error: code = NotFound desc = exit status 1"
My guess is that the teardown never completes and therefore the volume cannot be remounted.
Look into this again today, userID error is gone with the canary images (might have been a imagePullPolicy: IfNotPresent thing, fixed that).
I tried to setup this comment as comprehansible as possible :)
The behaviour is still with one of the four pvs I am using at the moment. None of the others have this behaviour. I tried to provide all the logs again underneath. The issue seems similar to: https://github.com/kubernetes/kubernetes/issues/77969
Creation of the pod:
Kubelet:
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.591719 907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "transcode-volume" (UniqueName: "kubernetes.io/empty-dir/6e5ffb93-df89-41cf-98ea-a37a88b37616-transcode-volume") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692701 907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-tvshows-nfs-storage" (UniqueName: "kubernetes.io/nfs/6e5ffb93-df89-41cf-98ea-a37a88b37616-plex-tvshows-nfs-storage") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692768 907 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692832 907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-movies-nfs-storage" (UniqueName: "kubernetes.io/nfs/6e5ffb93-df89-41cf-98ea-a37a88b37616-plex-movies-nfs-storage") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692889 907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-token-drdxb" (UniqueName: "kubernetes.io/secret/6e5ffb93-df89-41cf-98ea-a37a88b37616-plex-token-drdxb") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.698568 907 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.698995 907 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.699070 907 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.699218 907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-09-05 13:02:28.1991703 +0000 UTC m=+2085422.800331744 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-6c58bdbf8-8nldz\" (UID: \"6e5ffb93-df89-41cf-98ea-a37a88b37616\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.793588 907 reconciler.go:177] operationExecutor.UnmountVolume started for volume "plex-config-persistent-storage" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "316d07de-9d07-4916-a595-2b33d966ab93" (UID: "316d07de-9d07-4916-a595-2b33d966ab93")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.793902 907 clientconn.go:440] parsed scheme: ""
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.793949 907 clientconn.go:440] scheme "" not registered, fallback to default scheme
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794053 907 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0 <nil>}]
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794101 907 clientconn.go:796] ClientConn switching balancer to "pick_first"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794190 907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001ad67d0, CONNECTING
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794582 907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001ad67d0, READY
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.799025 907 csi_mounter.go:366] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = NotFound desc = exit status 1
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.799147 907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\" (\"316d07de-9d07-4916-a595-2b33d966ab93\")" failed. No retries permitted until 2019-09-05 13:02:28.299100437 +0000 UTC m=+2085422.900261976 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"plex-config-persistent-storage\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"316d07de-9d07-4916-a595-2b33d966ab93\" (UID: \"316d07de-9d07-4916-a595-2b33d966ab93\") : rpc error: code = NotFound desc = exit status 1"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.995664 907 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:28 node06 kubelet[907]: I0905 13:02:28.000045 907 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:28 node06 kubelet[907]: E0905 13:02:28.000298 907 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:28 node06 kubelet[907]: E0905 13:02:28.000331 907 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:28 node06 kubelet[907]: E0905 13:02:28.000462 907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-09-05 13:02:28.500419113 +0000 UTC m=+2085423.101580520 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-6c58bdbf8-8nldz\" (UID: \"6e5ffb93-df89-41cf-98ea-a37a88b37616\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"
Nodeplugin:
I0905 13:14:24.289697 23189 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0905 13:14:24.293507 23189 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0905 13:14:24.293624 23189 utils.go:123] ID: 5850 GRPC error: rpc error: code = NotFound desc = exit status 1
Deletion of the pod:
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.458101 907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\" (\"316d07de-9d07-4916-a595-2b33d966ab93\")" failed. No retries permitted until 2019-09-05 13:04:07.958054033 +0000 UTC m=+2085522.559215617 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"plex-config-persistent-storage\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"316d07de-9d07-4916-a595-2b33d966ab93\" (UID: \"316d07de-9d07-4916-a595-2b33d966ab93\") : rpc error: code = NotFound desc = exit status 1"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.654684 907 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.659413 907 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.659689 907 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.659723 907 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.659858 907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-09-05 13:04:08.15981253 +0000 UTC m=+2085522.760973977 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-6c58bdbf8-8nldz\" (UID: \"6e5ffb93-df89-41cf-98ea-a37a88b37616\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755389 907 reconciler.go:177] operationExecutor.UnmountVolume started for volume "plex-config-persistent-storage" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "316d07de-9d07-4916-a595-2b33d966ab93" (UID: "316d07de-9d07-4916-a595-2b33d966ab93")
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755688 907 clientconn.go:440] parsed scheme: ""
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755742 907 clientconn.go:440] scheme "" not registered, fallback to default scheme
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755896 907 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0 <nil>}]
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755959 907 clientconn.go:796] ClientConn switching balancer to "pick_first"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.756287 907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001c36fd0, CONNECTING
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.756396 907 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.756746 907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001c36fd0, READY
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.763154 907 csi_mounter.go:366] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = NotFound desc = exit status 1
After deletion the kubelet logs are filled with:
Sep 05 13:05:46 node06 kubelet[907]: E0905 13:05:46.745519 907 kubelet_volumes.go:154] Orphaned pod "6e5ffb93-df89-41cf-98ea-a37a88b37616" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
And indeed the pod PVC is still there, as the only one of the volumes:
root@node06:~# ls -l /var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~*
/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~csi:
total 4
drwxr-x--- 2 root root 4096 Sep 5 13:02 pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe
/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~empty-dir:
total 0
/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~nfs:
total 0
/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~secret:
total 0
But has no mount
root@node06:~# ls -l /var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/
total 4
-rw-r--r-- 1 root root 339 Sep 5 13:04 vol_data.json
The globalmount still exists and throws the input/ouput error:
root@node06:~# ls -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe
ls: cannot access '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount': Input/output error
total 4
d????????? ? ? ? ? ? globalmount
-rw-r--r-- 1 root root 152 Aug 12 09:58 vol_data.json
Found a workaround for anyone with the same problem:
lsblk
sudo umount /dev/nbd2
sudo umount /dev/nbd2
lsblk
to see that device has no mountpoint anymoresystemctl restart kubelet
(otherwise the missing volume does not seem noted)kubectl delete pod <podname>
This will fix the issue for some time, I will migrate to ext4 hopefully soon and update this issue to see if it resolves :)
Hey, @wilmardo any particular reason not to use krbd? we have not completely tested rbd-nbd
yet
hope you have https://github.com/ceph/ceph-csi/pull/648 changes in your templates
@Madhu-1 Thanks for the pointer! This issue now is "resolved" by using the kernel mounter, running stable for more than a month now.
Should I leave this open since there for sure is an issue with the rbd-nbd
?
@wilmardo I'm also encountering the same issue, but this time I am using filestore. Any idea how can I execute the same command using filestore (I do not have real disk devices) but I just used directories
@Madhu-1 Thanks for the pointer! This issue now is "resolved" by using the kernel mounter, running stable for more than a month now.
Should I leave this open since there for sure is an issue with the
rbd-nbd
?
@wilmardo the issue with fuse
or nbd
here is a known one and we already have some trackers for it. Considering the same I would like to close this issue.
@vvavepacket I would suggest to open a new issue if you are facing any.
Closing this for now.
Describe the bug
Mounting of a volume fails with input/output error:
Environment details
Logs
csi-attacher:
csi-provisioner:
csi-rbdplugin:
kubelet logs
Steps to reproduce
I do not know how to reproduce this sadly. All my other volumes do not experience this behavior. I hope the logs show something :) The volume is still in this state however, when I delete the pod to get it rescheduled it still does not work so I can reproduce it in my cluster over and over again.
I think there is an issue here:
Then it does not release the volume and kubelet starts to complain about orphaned volumes. When the pod is rescheduled on the same node it fail to mount since the previous volume is still attached (?).
Actual results
The pod is stuck on ContainerCreating forever because the volume cannot be mounted.
Expected behavior
The pod to start up with the volume nicely mounted.