ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

Mounting of RBD volume fails with input/output error #568

Closed wilmardo closed 4 years ago

wilmardo commented 5 years ago

Describe the bug

Mounting of a volume fails with input/output error:

Events:
  Type     Reason       Age   From               Message
  ----     ------       ----  ----               -------
  Normal   Scheduled    100s  default-scheduler  Successfully assigned streaming/plex-5fbbf6694f-nrtl4 to node05
  Warning  FailedMount  4s    kubelet, node05    MountVolume.MountDevice failed for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error

Environment details

Logs

csi-attacher:

I0823 06:14:10.629412       1 trivial_handler.go:49] Trivial sync[csi-1e0a87deba8645981bc6515847386bbf1de391387f9f7626c9f99411326c94f4] started
I0823 06:14:10.629064       1 trivial_handler.go:49] Trivial sync[csi-fce5695c9222907527b14bb6a5d7c4d8c866838bf7ea740c9f5f2c2d461b1c4f] started
I0823 06:14:10.629086       1 controller.go:205] Started PV processing "pvc-06155c3d-0b8c-44a2-94aa-5b009b87308b"
I0823 06:14:10.629203       1 trivial_handler.go:49] Trivial sync[csi-2ca56e9afe3497aa222c8f4beea21313bc5c51736c6219df6cd23a5e0a294782] started
I0823 06:16:13.852041       1 reflector.go:370] k8s.io/client-go/informers/factory.go:133: Watch close - *v1beta1.VolumeAttachment total 0 items received
I0823 06:19:33.865865       1 reflector.go:370] k8s.io/client-go/informers/factory.go:133: Watch close - *v1.PersistentVolume total 0 items received

csi-provisioner:

0823 06:18:27.760597       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:18:32.775979       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:18:37.789708       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:18:42.803050       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:18:47.819521       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:18:52.838370       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:18:57.851769       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:02.870969       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:07.888328       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:12.904024       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:17.920981       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:21.333127       1 reflector.go:370] sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:803: Watch close - *v1.PersistentVolume total 0 items received
I0823 06:19:22.941350       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:27.955348       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:32.976616       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:37.993064       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:43.010305       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:48.024349       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:53.039748       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:19:58.057264       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:03.070592       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:08.093751       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:13.108916       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:14.334928       1 reflector.go:370] sigs.k8s.io/sig-storage-lib-external-provisioner/controller/controller.go:800: Watch close - *v1.PersistentVolumeClaim total 0 items received
I0823 06:20:18.123697       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:23.137800       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:28.155980       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:33.169902       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:38.186149       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:42.425644       1 controller.go:1308] delete "pvc-d2cd7ce3-f9d7-4828-a85b-b65f00f64c10": started
I0823 06:20:42.440631       1 connection.go:180] GRPC call: /csi.v1.Controller/DeleteVolume
I0823 06:20:42.440684       1 connection.go:181] GRPC request: {"secrets":"***stripped***","volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd"}
I0823 06:20:43.199552       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:43.405600       1 connection.go:183] GRPC response: {}
I0823 06:20:43.406349       1 connection.go:184] GRPC error: rpc error: code = Internal desc = rbd csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd is still being used
E0823 06:20:43.406388       1 controller.go:1331] delete "pvc-d2cd7ce3-f9d7-4828-a85b-b65f00f64c10": volume deletion failed: rpc error: code = Internal desc = rbd csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd is still being used
W0823 06:20:43.406473       1 controller.go:934] Retrying syncing volume "pvc-d2cd7ce3-f9d7-4828-a85b-b65f00f64c10", failure 145
E0823 06:20:43.406536       1 controller.go:952] error syncing volume "pvc-d2cd7ce3-f9d7-4828-a85b-b65f00f64c10": rpc error: code = Internal desc = rbd csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd is still being used
I0823 06:20:43.406726       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-d2cd7ce3-f9d7-4828-a85b-b65f00f64c10", UID:"b80891b4-d582-4f23-af20-447ec6342947", APIVersion:"v1", ResourceVersion:"2971666", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = Internal desc = rbd csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd is still being used
I0823 06:20:48.212936       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:53.226800       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:20:58.240116       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:03.253608       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:08.265918       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:13.280058       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:18.293271       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:23.307719       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:28.321203       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com
I0823 06:21:33.431265       1 leaderelection.go:258] successfully renewed lease storing/rbd-csi-ceph-com

csi-rbdplugin:

I0823 06:20:42.442046       1 utils.go:105] GRPC call: /csi.v1.Controller/DeleteVolume
I0823 06:20:42.442117       1 utils.go:106] GRPC request: {"secrets":"***stripped***","volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd"}
I0823 06:20:43.290203       1 controllerserver.go:337] deleting image csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd
I0823 06:20:43.290269       1 rbd_util.go:140] rbd: status csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd using mon 192.168.2.100:6789, pool kubernetes
I0823 06:20:43.404993       1 rbd_util.go:159] rbd: watchers on csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd: 2019-08-23 06:20:43.351 7f179691bb00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.kubernetes.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2019-08-23 06:20:43.355 7f179691bb00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.kubernetes.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2019-08-23 06:20:43.355 7f179691bb00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.kubernetes.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2019-08-23 06:20:43.363 7f179691bb00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.kubernetes.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2019-08-23 06:20:43.363 7f179691bb00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.kubernetes.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
2019-08-23 06:20:43.363 7f179691bb00 -1 auth: unable to find a keyring on /etc/ceph/ceph.client.kubernetes.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,: (2) No such file or directory
Watchers:
        watcher=192.168.2.155:0/2851647683 client.3215876 cookie=139784542502688
I0823 06:20:43.405066       1 rbd_util.go:176] rbd is still being used csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd
E0823 06:20:43.405121       1 controllerserver.go:339] failed to delete rbd image: kubernetes/csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd with error: rbd csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd is still being used
E0823 06:20:43.405149       1 utils.go:109] GRPC error: rpc error: code = Internal desc = rbd csi-vol-edc5a3ec-c26b-11e9-9749-1ae57eb3c3fd is still being used

kubelet logs

Aug 23 06:42:02 node05 kubelet[924]: E0823 06:42:02.531869     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:04 node05 kubelet[924]: E0823 06:42:04.525745     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:06 node05 kubelet[924]: E0823 06:42:06.534544     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:08 node05 kubelet[924]: E0823 06:42:08.524713     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:10 node05 kubelet[924]: E0823 06:42:10.526565     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:12 node05 kubelet[924]: E0823 06:42:12.530299     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:14 node05 kubelet[924]: E0823 06:42:14.528843     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:16 node05 kubelet[924]: E0823 06:42:16.532023     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:18 node05 kubelet[924]: E0823 06:42:18.530003     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:20 node05 kubelet[924]: E0823 06:42:20.530837     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.260825     924 reconciler.go:177] operationExecutor.UnmountVolume started for volume "plex-token-fh9kp" (UniqueName: "kubernetes.io/secret/11dad472-e834-4114-be5f-91923ba01370-plex-token-fh9kp") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370")
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.260954     924 reconciler.go:177] operationExecutor.UnmountVolume started for volume "transcode-volume" (UniqueName: "kubernetes.io/empty-dir/11dad472-e834-4114-be5f-91923ba01370-transcode-volume") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370")
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.261032     924 reconciler.go:177] operationExecutor.UnmountVolume started for volume "movies-nfs-storage" (UniqueName: "kubernetes.io/nfs/11dad472-e834-4114-be5f-91923ba01370-plex-movies-nfs-storage") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370")
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.261113     924 reconciler.go:177] operationExecutor.UnmountVolume started for volume "tvshows-nfs-storage" (UniqueName: "kubernetes.io/nfs/11dad472-e834-4114-be5f-91923ba01370-plex-tvshows-nfs-storage") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370")
Aug 23 06:42:22 node05 kubelet[924]: W0823 06:42:22.261240     924 empty_dir.go:421] Warning: Failed to clear quota on /var/lib/kubelet/pods/11dad472-e834-4114-be5f-91923ba01370/volumes/kubernetes.io~empty-dir/transcode-volume: ClearQuota called, but quotas disabled
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.261408     924 operation_generator.go:860] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/11dad472-e834-4114-be5f-91923ba01370-transcode-volume" (OuterVolumeSpecName: "transcode-volume") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370"). InnerVolumeSpecName "transcode-volume". PluginName "kubernetes.io/empty-dir", VolumeGidValue ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.289135     924 operation_generator.go:860] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/11dad472-e834-4114-be5f-91923ba01370-plex-token-fh9kp" (OuterVolumeSpecName: "plex-token-fh9kp") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370"). InnerVolumeSpecName "plex-token-fh9kp". PluginName "kubernetes.io/secret", VolumeGidValue ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.289918     924 operation_generator.go:860] UnmountVolume.TearDown succeeded for volume "kubernetes.io/nfs/11dad472-e834-4114-be5f-91923ba01370-plex-tvshows-nfs-storage" (OuterVolumeSpecName: "tvshows-nfs-storage") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370"). InnerVolumeSpecName "plex-tvshows-nfs-storage". PluginName "kubernetes.io/nfs", VolumeGidValue ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.329993     924 operation_generator.go:860] UnmountVolume.TearDown succeeded for volume "kubernetes.io/nfs/11dad472-e834-4114-be5f-91923ba01370-plex-movies-nfs-storage" (OuterVolumeSpecName: "movies-nfs-storage") pod "11dad472-e834-4114-be5f-91923ba01370" (UID: "11dad472-e834-4114-be5f-91923ba01370"). InnerVolumeSpecName "plex-movies-nfs-storage". PluginName "kubernetes.io/nfs", VolumeGidValue ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.375043     924 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "transcode-volume" (UniqueName: "kubernetes.io/empty-dir/9b567d3f-3658-49dc-914f-bc49330885d5-transcode-volume") pod "plex-5fbbf6694f-4zxvb" (UID: "9b567d3f-3658-49dc-914f-bc49330885d5")
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.375167     924 reconciler.go:297] Volume detached for volume "plex-tvshows-nfs-storage" (UniqueName: "kubernetes.io/nfs/11dad472-e834-4114-be5f-91923ba01370-plex-tvshows-nfs-storage") on node "node05" DevicePath ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.375325     924 reconciler.go:297] Volume detached for volume "plex-token-fh9kp" (UniqueName: "kubernetes.io/secret/11dad472-e834-4114-be5f-91923ba01370-plex-token-fh9kp") on node "node05" DevicePath ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.375383     924 reconciler.go:297] Volume detached for volume "transcode-volume" (UniqueName: "kubernetes.io/empty-dir/11dad472-e834-4114-be5f-91923ba01370-transcode-volume") on node "node05" DevicePath ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.375421     924 reconciler.go:297] Volume detached for volume "plex-movies-nfs-storage" (UniqueName: "kubernetes.io/nfs/11dad472-e834-4114-be5f-91923ba01370-plex-movies-nfs-storage") on node "node05" DevicePath ""
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.476205     924 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-movies-nfs-storage" (UniqueName: "kubernetes.io/nfs/9b567d3f-3658-49dc-914f-bc49330885d5-plex-movies-nfs-storage") pod "plex-5fbbf6694f-4zxvb" (UID: "9b567d3f-3658-49dc-914f-bc49330885d5")
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.476332     924 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-tvshows-nfs-storage" (UniqueName: "kubernetes.io/nfs/9b567d3f-3658-49dc-914f-bc49330885d5-plex-tvshows-nfs-storage") pod "plex-5fbbf6694f-4zxvb" (UID: "9b567d3f-3658-49dc-914f-bc49330885d5")
Aug 23 06:42:22 node05 kubelet[924]: I0823 06:42:22.476425     924 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-token-fh9kp" (UniqueName: "kubernetes.io/secret/9b567d3f-3658-49dc-914f-bc49330885d5-plex-token-fh9kp") pod "plex-5fbbf6694f-4zxvb" (UID: "9b567d3f-3658-49dc-914f-bc49330885d5")
Aug 23 06:42:22 node05 kubelet[924]: E0823 06:42:22.530987     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:24 node05 kubelet[924]: E0823 06:42:24.535019     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:26 node05 kubelet[924]: E0823 06:42:26.532067     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:27 node05 kubelet[924]: I0823 06:42:27.005602     924 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-5fbbf6694f-nrtl4" (UID: "11dad472-e834-4114-be5f-91923ba01370") DevicePath "csi-fce5695c9222907527b14bb6a5d7c4d8c866838bf7ea740c9f5f2c2d461b1c4f"
Aug 23 06:42:27 node05 kubelet[924]: I0823 06:42:27.011087     924 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-5fbbf6694f-nrtl4" (UID: "11dad472-e834-4114-be5f-91923ba01370") DevicePath "csi-fce5695c9222907527b14bb6a5d7c4d8c866838bf7ea740c9f5f2c2d461b1c4f"
Aug 23 06:42:27 node05 kubelet[924]: E0823 06:42:27.011416     924 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Aug 23 06:42:27 node05 kubelet[924]: E0823 06:42:27.011480     924 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Aug 23 06:42:27 node05 kubelet[924]: E0823 06:42:27.011617     924 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-08-23 06:44:29.011570614 +0000 UTC m=+1371867.633383435 (durationBeforeRetry 2m2s). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-5fbbf6694f-nrtl4\" (UID: \"11dad472-e834-4114-be5f-91923ba01370\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"
Aug 23 06:42:28 node05 kubelet[924]: E0823 06:42:28.534004     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:29 node05 kubelet[924]: I0823 06:42:29.115987     924 operation_generator.go:193] parsed scheme: ""
Aug 23 06:42:29 node05 kubelet[924]: I0823 06:42:29.116065     924 operation_generator.go:193] scheme "" not registered, fallback to default scheme
Aug 23 06:42:29 node05 kubelet[924]: I0823 06:42:29.116163     924 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Aug 23 06:42:29 node05 kubelet[924]: I0823 06:42:29.116204     924 clientconn.go:796] ClientConn switching balancer to "pick_first"
Aug 23 06:42:29 node05 kubelet[924]: I0823 06:42:29.116342     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc0011b2210, CONNECTING
Aug 23 06:42:29 node05 kubelet[924]: I0823 06:42:29.116736     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc0011b2210, READY
Aug 23 06:42:29 node05 kubelet[924]: E0823 06:42:29.119227     924 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock" failed. No retries permitted until 2019-08-23 06:44:31.119170563 +0000 UTC m=+1371869.740985340 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- failed to get plugin info using RPC GetInfo at socket /var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock, err: rpc error: code = Unimplemented desc = unknown service pluginregistration.Registration"
Aug 23 06:42:30 node05 kubelet[924]: E0823 06:42:30.530197     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:32 node05 kubelet[924]: E0823 06:42:32.533030     924 kubelet_volumes.go:154] Orphaned pod "953ad598-9230-42a6-a571-943ec9f24dcb" found, but volume paths are still present on disk : There were a total of 6 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:34 node05 kubelet[924]: E0823 06:42:34.525588     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:35 node05 kubelet[924]: I0823 06:42:35.118675     924 operation_generator.go:193] parsed scheme: ""
Aug 23 06:42:35 node05 kubelet[924]: I0823 06:42:35.118738     924 operation_generator.go:193] scheme "" not registered, fallback to default scheme
Aug 23 06:42:35 node05 kubelet[924]: I0823 06:42:35.118816     924 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock 0  <nil>}]
Aug 23 06:42:35 node05 kubelet[924]: I0823 06:42:35.118847     924 clientconn.go:796] ClientConn switching balancer to "pick_first"
Aug 23 06:42:35 node05 kubelet[924]: I0823 06:42:35.118949     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, CONNECTING
Aug 23 06:42:35 node05 kubelet[924]: W0823 06:42:35.119143     924 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock: connect: connection refused". Reconnecting...
Aug 23 06:42:35 node05 kubelet[924]: I0823 06:42:35.119215     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, TRANSIENT_FAILURE
Aug 23 06:42:36 node05 kubelet[924]: I0823 06:42:36.119490     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, CONNECTING
Aug 23 06:42:36 node05 kubelet[924]: W0823 06:42:36.119537     924 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock: connect: connection refused". Reconnecting...
Aug 23 06:42:36 node05 kubelet[924]: I0823 06:42:36.119600     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, TRANSIENT_FAILURE
Aug 23 06:42:36 node05 kubelet[924]: E0823 06:42:36.531686     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:37 node05 kubelet[924]: I0823 06:42:37.703373     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, CONNECTING
Aug 23 06:42:37 node05 kubelet[924]: W0823 06:42:37.703443     924 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock: connect: connection refused". Reconnecting...
Aug 23 06:42:37 node05 kubelet[924]: I0823 06:42:37.703540     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, TRANSIENT_FAILURE
Aug 23 06:42:38 node05 kubelet[924]: E0823 06:42:38.531042     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:39 node05 kubelet[924]: I0823 06:42:39.788900     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, CONNECTING
Aug 23 06:42:39 node05 kubelet[924]: W0823 06:42:39.788983     924 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock: connect: connection refused". Reconnecting...
Aug 23 06:42:39 node05 kubelet[924]: I0823 06:42:39.789083     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, TRANSIENT_FAILURE
Aug 23 06:42:40 node05 kubelet[924]: E0823 06:42:40.525448     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:42 node05 kubelet[924]: E0823 06:42:42.529145     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:43 node05 kubelet[924]: W0823 06:42:43.825384     924 clientconn.go:1251] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial unix /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock: connect: connection refused". Reconnecting...
Aug 23 06:42:43 node05 kubelet[924]: I0823 06:42:43.825419     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, CONNECTING
Aug 23 06:42:43 node05 kubelet[924]: I0823 06:42:43.825598     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00093b150, TRANSIENT_FAILURE
Aug 23 06:42:44 node05 kubelet[924]: E0823 06:42:44.531607     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:45 node05 kubelet[924]: E0823 06:42:45.119000     924 goroutinemap.go:150] Operation for "/var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock" failed. No retries permitted until 2019-08-23 06:44:47.118949292 +0000 UTC m=+1371885.740762755 (durationBeforeRetry 2m2s). Error: "RegisterPlugin error -- dial failed at socket /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock, err: failed to dial socket /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock, err: context deadline exceeded"
Aug 23 06:42:45 node05 kubelet[924]: W0823 06:42:45.119080     924 asm_amd64.s:1337] Failed to dial /var/lib/kubelet/plugins/cephfs.csi.ceph.com/csi.sock: context canceled; please retry.
Aug 23 06:42:46 node05 kubelet[924]: E0823 06:42:46.526835     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.880359     924 clientconn.go:440] parsed scheme: ""
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.880543     924 clientconn.go:440] scheme "" not registered, fallback to default scheme
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.880721     924 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.880778     924 clientconn.go:796] ClientConn switching balancer to "pick_first"
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.880956     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00107ed90, CONNECTING
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.881022     924 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.881452     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00107ed90, READY
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.889419     924 clientconn.go:440] parsed scheme: ""
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.889521     924 clientconn.go:440] scheme "" not registered, fallback to default scheme
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.889620     924 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.889660     924 clientconn.go:796] ClientConn switching balancer to "pick_first"
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.889794     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00107f0a0, CONNECTING
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.889831     924 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Aug 23 06:42:47 node05 kubelet[924]: I0823 06:42:47.890232     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00107f0a0, READY
Aug 23 06:42:48 node05 kubelet[924]: E0823 06:42:48.529132     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:50 node05 kubelet[924]: E0823 06:42:50.528107     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:52 node05 kubelet[924]: E0823 06:42:52.531756     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:54 node05 kubelet[924]: E0823 06:42:54.529109     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:56 node05 kubelet[924]: E0823 06:42:56.528255     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:42:58 node05 kubelet[924]: E0823 06:42:58.534116     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:00 node05 kubelet[924]: E0823 06:43:00.525000     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:02 node05 kubelet[924]: E0823 06:43:02.514816     924 kubelet.go:1665] Unable to mount volumes for pod "plex-5fbbf6694f-nrtl4_streaming(11dad472-e834-4114-be5f-91923ba01370)": timeout expired waiting for volumes to attach or mount for pod "streaming"/"plex-5fbbf6694f-nrtl4". list of unmounted volumes=[plex-config-persistent-storage transcode-volume tvshows-nfs-storage movies-nfs-storage plex-token-fh9kp]. list of unattached volumes=[plex-config-persistent-storage transcode-volume tvshows-nfs-storage movies-nfs-storage plex-token-fh9kp]; skipping pod
Aug 23 06:43:02 node05 kubelet[924]: E0823 06:43:02.515003     924 pod_workers.go:190] Error syncing pod 11dad472-e834-4114-be5f-91923ba01370 ("plex-5fbbf6694f-nrtl4_streaming(11dad472-e834-4114-be5f-91923ba01370)"), skipping: timeout expired waiting for volumes to attach or mount for pod "streaming"/"plex-5fbbf6694f-nrtl4". list of unmounted volumes=[plex-config-persistent-storage transcode-volume tvshows-nfs-storage movies-nfs-storage plex-token-fh9kp]. list of unattached volumes=[plex-config-persistent-storage transcode-volume tvshows-nfs-storage movies-nfs-storage plex-token-fh9kp]
Aug 23 06:43:02 node05 kubelet[924]: E0823 06:43:02.531155     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.524292     924 clientconn.go:440] parsed scheme: ""
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.524951     924 clientconn.go:440] scheme "" not registered, fallback to default scheme
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.525023     924 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.525053     924 clientconn.go:796] ClientConn switching balancer to "pick_first"
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.525110     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00100e250, CONNECTING
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.525360     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00100e250, READY
Aug 23 06:43:04 node05 kubelet[924]: E0823 06:43:04.529875     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530321     924 clientconn.go:440] parsed scheme: ""
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530364     924 clientconn.go:440] scheme "" not registered, fallback to default scheme
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530419     924 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530445     924 clientconn.go:796] ClientConn switching balancer to "pick_first"
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530529     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00100e5f0, CONNECTING
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530562     924 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Aug 23 06:43:04 node05 kubelet[924]: I0823 06:43:04.530748     924 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc00100e5f0, READY
Aug 23 06:43:06 node05 kubelet[924]: E0823 06:43:06.531037     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:08 node05 kubelet[924]: E0823 06:43:08.525101     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:10 node05 kubelet[924]: E0823 06:43:10.528091     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.
Aug 23 06:43:12 node05 kubelet[924]: E0823 06:43:12.529265     924 kubelet_volumes.go:154] Orphaned pod "11dad472-e834-4114-be5f-91923ba01370" found, but volume paths are still present on disk : There were a total of 7 errors similar to this. Turn up verbosity to see them.

Steps to reproduce

I do not know how to reproduce this sadly. All my other volumes do not experience this behavior. I hope the logs show something :) The volume is still in this state however, when I delete the pod to get it rescheduled it still does not work so I can reproduce it in my cluster over and over again.

I think there is an issue here:

Aug 23 06:42:27 node05 kubelet[924]: E0823 06:42:27.011416     924 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Aug 23 06:42:27 node05 kubelet[924]: E0823 06:42:27.011480     924 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]

Then it does not release the volume and kubelet starts to complain about orphaned volumes. When the pod is rescheduled on the same node it fail to mount since the previous volume is still attached (?).

Actual results

The pod is stuck on ContainerCreating forever because the volume cannot be mounted.

Expected behavior

The pod to start up with the volume nicely mounted.

Madhu-1 commented 5 years ago

@wilmardo can you provide the rbd node plugin logs to find out what happened. btw in the latest canary image, we have fixed some issues related to the mounting failure.

earlier we are not unmounting the rbd device if the mounting fails, this has been fixed now if the pod moves to a different node you can still use the volume

wilmardo commented 5 years ago

I grabbed the logs from the times above and saw nothing peculiar. So I redeleted the pod but again nothing different from the previous logs. What would you expect to see in the logs? @Madhu-1

Current logs

csi-rbdplugin

I0823 06:14:33.895938   12416 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS
I0823 06:14:33.895979   12416 mount_linux.go:171] systemd-run failed with: exit status 1
I0823 06:14:33.895998   12416 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory
I0823 06:14:33.896069   12416 utils.go:115] GRPC response: {"usage":[{"available":494272512,"total":520785920,"unit":1,"used":26513408},{"available":255996,"total":256000,"unit":2,"used":4}]}
I0823 06:14:57.408770   12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0823 06:14:57.408824   12416 utils.go:110] GRPC request: {}
I0823 06:14:57.410824   12416 utils.go:115] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0823 06:14:57.413907   12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I0823 06:14:57.413943   12416 utils.go:110] GRPC request: {"volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-a86804f9-b8f8-11e9-867b-aa9c979ac781","volume_path":"/var/lib/kubelet/pods/55c58b43-87af-4b65-b0ea-e4c0ebb26597/volumes/kubernetes.io~csi/pvc-4ae1e299-d2b5-4568-a489-a564d6073fbd/mount"}
I0823 06:14:57.417016   12416 mount_linux.go:170] Cannot run systemd-run, assuming non-systemd OS
I0823 06:14:57.417078   12416 mount_linux.go:171] systemd-run failed with: exit status 1
I0823 06:14:57.417118   12416 mount_linux.go:172] systemd-run output: Failed to create bus connection: No such file or directory
I0823 06:14:57.417224   12416 utils.go:115] GRPC response: {"usage":[{"available":5007634432,"total":5358223360,"unit":1,"used":350588928},{"available":2620356,"total":2621440,"unit":2,"used":1084}]}
I0823 06:15:40.245502   12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0823 06:15:40.245555   12416 utils.go:110] GRPC request: {}
I0823 06:15:40.247193   12416 utils.go:115] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I0823 06:15:40.250113   12416 utils.go:109] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I0823 06:15:40.250148   12416 utils.go:110] GRPC request: {"volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-a82a306c-b8f8-11e9-867b-aa9c979ac781","volume_path":"/var/lib/kubelet/pods/a2b478c3-3584-45b5-bc79-9cc4c04637b3/volumes/kubernetes.io~csi/pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a/mount"}

driver-registrar

I0807 09:48:17.953286   12200 main.go:110] Version: v1.1.0-0-g80a94421
I0807 09:48:17.953366   12200 main.go:120] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0807 09:48:17.953393   12200 connection.go:151] Connecting to unix:///csi/csi.sock
I0807 09:48:26.246243   12200 main.go:127] Calling CSI driver to discover driver name
I0807 09:48:26.246362   12200 connection.go:180] GRPC call: /csi.v1.Identity/GetPluginInfo
I0807 09:48:26.246376   12200 connection.go:181] GRPC request: {}
I0807 09:48:26.271310   12200 connection.go:183] GRPC response: {"name":"rbd.csi.ceph.com","vendor_version":"canary"}
I0807 09:48:26.272416   12200 connection.go:184] GRPC error: <nil>
I0807 09:48:26.272439   12200 main.go:137] CSI driver name: "rbd.csi.ceph.com"
I0807 09:48:26.272560   12200 node_register.go:54] Starting Registration Server at: /registration/rbd.csi.ceph.com-reg.sock
I0807 09:48:26.272735   12200 node_register.go:61] Registration Server started at: /registration/rbd.csi.ceph.com-reg.sock
I0807 09:48:26.931294   12200 main.go:77] Received GetInfo call: &InfoRequest{}
I0807 09:48:27.096205   12200 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Madhu-1 commented 5 years ago

I was expecting some error message in nodestage volume but am not able to see it.

wilmardo commented 5 years ago

btw in the latest canary image, we have fixed some issues related to the mounting failure.

I will try this later today, right now I am working on the Helm chart :)

wilmardo commented 5 years ago

Tried it with the canary but to no avail, still the same error. But I dug some more and found this error in the nodeplugin:

I0824 18:42:17.399798   12416 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
I0824 18:42:17.402761   12416 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: exit status 1

Which led me to that directory and saw:

root@node05:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe# ls -l
ls: cannot access 'globalmount': Input/output error
total 4
d????????? ? ?    ?      ?            ? globalmount
-rw-r--r-- 1 root root 152 Aug 18 18:08 vol_data.json

So the globalmount directory is created funky.

I drained the node and left it drained, redeployed the pod and it got scheduled to another node. There the error seems the same only at another location and this error keeps spamming:

I0824 19:00:47.347245   15873 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0824 19:00:47.349883   15873 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0824 19:00:47.349934   15873 utils.go:109] GRPC error: rpc error: code = NotFound desc = exit status 1
I0824 19:00:47.648840   15873 utils.go:105] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0824 19:00:47.649222   15873 utils.go:106] GRPC request: {"target_path":"/var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount","volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010"}
I0824 19:00:47.649924   15873 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0824 19:00:47.652700   15873 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0824 19:00:47.652810   15873 utils.go:109] GRPC error: rpc error: code = NotFound desc = exit status 1
I0824 19:00:47.951231   15873 utils.go:105] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0824 19:00:47.951287   15873 utils.go:106] GRPC request: {"target_path":"/var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount","volume_id":"0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010"}
I0824 19:00:47.952082   15873 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0824 19:00:47.954777   15873 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0824 19:00:47.954853   15873 utils.go:109] GRPC error: rpc error: code = NotFound desc = exit status 1
root@node06:/var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe# ls -l
ls: cannot access 'mount': Input/output error
total 4
d????????? ? ?    ?      ?            ? mount
-rw-r--r-- 1 root root 339 Aug 12 09:58 vol_data.json
wilmardo commented 5 years ago

Also found this in the csi-attacher logs:

W0824 18:17:17.551275       1 trivial_handler.go:53] Error saving VolumeAttachment csi-469dcf8ba716aad45a0ee2b982e35265c774c48cf6dc9af048e3ee8e6fcb7e80 as attached: volumeattachments.storage.k8s.io "csi-469dcf8ba716aad45a0ee2b982e35265c774c48cf6dc9af048e3ee8e6fcb7e80" is forbidden: User "system:serviceaccount:storing:ceph-csi-rbd-provisioner" cannot patch resource "volumeattachments" in API group "storage.k8s.io" at the cluster scope

And added patch to the clusterrole rules which fixed it. Should I patch this in #570?

wilmardo commented 5 years ago

@Madhu-1 Also with the new canary no pods can be recreated:

Events:
  Type     Reason       Age    From               Message
  ----     ------       ----   ----               -------
  Normal   Scheduled    2m59s  default-scheduler  Successfully assigned downloading/nzbget-67cd57dbbc-rkmz8 to node06
  Warning  FailedMount  61s    kubelet, node06    MountVolume.SetUp failed for volume "pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a" : rpc error: code = Internal desc = missing ID field 'userID' in secrets
  Warning  FailedMount  56s    kubelet, node06    Unable to mount volumes for pod "nzbget-67cd57dbbc-rkmz8_downloading(b0d7d1a4-823f-4607-ac86-a6ee7b721d14)": timeout expired waiting for volumes to attach or mount for pod "downloading"/"nzbget-67cd57dbbc-rkmz8". list of unmounted volumes=[nzbget-config-persistent-storage]. list of unattached volumes=[nzbget-config-persistent-storage downloads-persistent-storage nzbget-token-sctz4]

While the userID is available and the config is working with the latest release. Just did a rollback to the v1.x images and it started working again without changing the config.

Edit:

That is was working when downgraded seemed to be a lucky shot, now it isn't working again (same error message) with the release images. Will try to troubleshoot this further but what I deployed was a working setup before I updated to the canary images.

Madhu-1 commented 5 years ago

@wilmardo canary image need node-stage secret to mount the volume, are you referring to the storage class of master branch

wilmardo commented 5 years ago

This is my current storageclass:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: csi-rbd-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rbd.csi.ceph.com
parameters:
  clusterID: <id>
  pool: kubernetes
  mounter: rbd-nbd
  imageFeatures: layering
  fsType: xfs
  imageFormat: "2"
  csi.storage.k8s.io/provisioner-secret-name: csi-rbd-secret
  csi.storage.k8s.io/provisioner-secret-namespace: storing
  csi.storage.k8s.io/node-stage-secret-name: csi-rbd-secret
  csi.storage.k8s.io/node-stage-secret-namespace: storing
reclaimPolicy: Delete
mountOptions:
  - discard
allowVolumeExpansion: true

The secret exists and has valid data:

Name:         csi-rbd-secret
Namespace:    storing
Labels:       <none>
Annotations:
Type:         Opaque

Data
====
userID:   10 bytes
userKey:  40 bytes

I tried it again to test the Helm chart but both the v1.1.0 and the canary tag of the quay.io/cephcsi/cephcsi image aren't working and throwing the same error:

Events:
  Type     Reason                  Age                 From                     Message
  ----     ------                  ----                ----                     -------
  Normal   Scheduled               13m                 default-scheduler        Successfully assigned downloading/nzbget-558564dd95-qsxk5 to node06
  Normal   SuccessfulAttachVolume  13m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a"
  Warning  FailedMount             88s (x14 over 13m)  kubelet, node06          MountVolume.SetUp failed for volume "pvc-3ba1f36e-fa31-46cc-a27d-21722b0d551a" : rpc error: code = Internal desc = missing ID field 'userID' in secrets
  Warning  FailedMount             36s (x6 over 11m)   kubelet, node06          Unable to mount volumes for pod "nzbget-558564dd95-qsxk5_downloading(c72c186b-038d-45f2-9efe-cb6d19438a3e)": timeout expired waiting for volumes to attach or mount for pod "downloading"/"nzbget-558564dd95-qsxk5". list of unmounted volumes=[nzbget-config-persistent-storage]. list of unattached volumes=[nzbget-config-persistent-storage downloads-persistent-storage nzbget-token-6hhfl]
Madhu-1 commented 5 years ago

@wilmardo can you paste secret if possible

Madhu-1 commented 5 years ago

you have already attached secret,didnt noticed

wilmardo commented 5 years ago

I think there are several issues within this issue. First the issue in the OP which seems related to this in the node kubelet logs:

Aug 28 13:30:21 node06 kubelet[907]: E0828 13:30:21.345666     907 csi_mounter.go:366] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = NotFound desc = exit status 1
Aug 28 13:30:21 node06 kubelet[907]: E0828 13:30:21.345900     907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\" (\"316d07de-9d07-4916-a595-2b33d966ab93\")" failed. No retries permitted until 2019-08-28 13:32:23.345817343 +0000 UTC m=+1396017.946979634 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"plex-config-persistent-storage\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"316d07de-9d07-4916-a595-2b33d966ab93\" (UID: \"316d07de-9d07-4916-a595-2b33d966ab93\") : rpc error: code = NotFound desc = exit status 1"

My guess is that the teardown never completes and therefore the volume cannot be remounted.

wilmardo commented 5 years ago

Look into this again today, userID error is gone with the canary images (might have been a imagePullPolicy: IfNotPresent thing, fixed that).

I tried to setup this comment as comprehansible as possible :)

The behaviour is still with one of the four pvs I am using at the moment. None of the others have this behaviour. I tried to provide all the logs again underneath. The issue seems similar to: https://github.com/kubernetes/kubernetes/issues/77969

Logs:

Creation of the pod:

Kubelet:

Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.591719     907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "transcode-volume" (UniqueName: "kubernetes.io/empty-dir/6e5ffb93-df89-41cf-98ea-a37a88b37616-transcode-volume") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692701     907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-tvshows-nfs-storage" (UniqueName: "kubernetes.io/nfs/6e5ffb93-df89-41cf-98ea-a37a88b37616-plex-tvshows-nfs-storage") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692768     907 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692832     907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-movies-nfs-storage" (UniqueName: "kubernetes.io/nfs/6e5ffb93-df89-41cf-98ea-a37a88b37616-plex-movies-nfs-storage") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.692889     907 reconciler.go:203] operationExecutor.VerifyControllerAttachedVolume started for volume "plex-token-drdxb" (UniqueName: "kubernetes.io/secret/6e5ffb93-df89-41cf-98ea-a37a88b37616-plex-token-drdxb") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.698568     907 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.698995     907 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.699070     907 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.699218     907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-09-05 13:02:28.1991703 +0000 UTC m=+2085422.800331744 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-6c58bdbf8-8nldz\" (UID: \"6e5ffb93-df89-41cf-98ea-a37a88b37616\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.793588     907 reconciler.go:177] operationExecutor.UnmountVolume started for volume "plex-config-persistent-storage" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "316d07de-9d07-4916-a595-2b33d966ab93" (UID: "316d07de-9d07-4916-a595-2b33d966ab93")
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.793902     907 clientconn.go:440] parsed scheme: ""
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.793949     907 clientconn.go:440] scheme "" not registered, fallback to default scheme
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794053     907 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794101     907 clientconn.go:796] ClientConn switching balancer to "pick_first"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794190     907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001ad67d0, CONNECTING
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.794582     907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001ad67d0, READY
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.799025     907 csi_mounter.go:366] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = NotFound desc = exit status 1
Sep 05 13:02:27 node06 kubelet[907]: E0905 13:02:27.799147     907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\" (\"316d07de-9d07-4916-a595-2b33d966ab93\")" failed. No retries permitted until 2019-09-05 13:02:28.299100437 +0000 UTC m=+2085422.900261976 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"plex-config-persistent-storage\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"316d07de-9d07-4916-a595-2b33d966ab93\" (UID: \"316d07de-9d07-4916-a595-2b33d966ab93\") : rpc error: code = NotFound desc = exit status 1"
Sep 05 13:02:27 node06 kubelet[907]: I0905 13:02:27.995664     907 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:28 node06 kubelet[907]: I0905 13:02:28.000045     907 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:02:28 node06 kubelet[907]: E0905 13:02:28.000298     907 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:28 node06 kubelet[907]: E0905 13:02:28.000331     907 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:02:28 node06 kubelet[907]: E0905 13:02:28.000462     907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-09-05 13:02:28.500419113 +0000 UTC m=+2085423.101580520 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-6c58bdbf8-8nldz\" (UID: \"6e5ffb93-df89-41cf-98ea-a37a88b37616\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"

Nodeplugin:

I0905 13:14:24.289697   23189 nsenter.go:132] Running nsenter command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /usr/bin/realpath -e /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount]
I0905 13:14:24.293507   23189 nsenter.go:194] failed to resolve symbolic links on /var/lib/kubelet/pods/316d07de-9d07-4916-a595-2b33d966ab93/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/mount: exit status 1
E0905 13:14:24.293624   23189 utils.go:123] ID: 5850 GRPC error: rpc error: code = NotFound desc = exit status 1

Deletion of the pod:

Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.458101     907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\" (\"316d07de-9d07-4916-a595-2b33d966ab93\")" failed. No retries permitted until 2019-09-05 13:04:07.958054033 +0000 UTC m=+2085522.559215617 (durationBeforeRetry 500ms). Error: "UnmountVolume.TearDown failed for volume \"plex-config-persistent-storage\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"316d07de-9d07-4916-a595-2b33d966ab93\" (UID: \"316d07de-9d07-4916-a595-2b33d966ab93\") : rpc error: code = NotFound desc = exit status 1"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.654684     907 operation_generator.go:629] MountVolume.WaitForAttach entering for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.659413     907 operation_generator.go:638] MountVolume.WaitForAttach succeeded for volume "pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "plex-6c58bdbf8-8nldz" (UID: "6e5ffb93-df89-41cf-98ea-a37a88b37616") DevicePath "csi-2d34fea5e072b990cd5a1406f57a97dad6d3ce62d03939b33bbc7a184543ea20"
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.659689     907 csi_mounter.go:422] kubernetes.io/csi: isDirMounted IsLikelyNotMountPoint test failed for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.659723     907 csi_attacher.go:296] kubernetes.io/csi: attacher.MountDevice failed while checking mount status for dir [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount]
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.659858     907 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\"" failed. No retries permitted until 2019-09-05 13:04:08.15981253 +0000 UTC m=+2085522.760973977 (durationBeforeRetry 500ms). Error: "MountVolume.MountDevice failed for volume \"pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe\" (UniqueName: \"kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010\") pod \"plex-6c58bdbf8-8nldz\" (UID: \"6e5ffb93-df89-41cf-98ea-a37a88b37616\") : stat /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount: input/output error"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755389     907 reconciler.go:177] operationExecutor.UnmountVolume started for volume "plex-config-persistent-storage" (UniqueName: "kubernetes.io/csi/rbd.csi.ceph.com^0001-0024-9d9cc8f6-0843-46c3-8ca3-0309dff9978d-0000000000000003-28a66c73-bab2-11e9-903a-122b397f0010") pod "316d07de-9d07-4916-a595-2b33d966ab93" (UID: "316d07de-9d07-4916-a595-2b33d966ab93")
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755688     907 clientconn.go:440] parsed scheme: ""
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755742     907 clientconn.go:440] scheme "" not registered, fallback to default scheme
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755896     907 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/rbd.csi.ceph.com/csi.sock 0  <nil>}]
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.755959     907 clientconn.go:796] ClientConn switching balancer to "pick_first"
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.756287     907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001c36fd0, CONNECTING
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.756396     907 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Sep 05 13:04:07 node06 kubelet[907]: I0905 13:04:07.756746     907 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc001c36fd0, READY
Sep 05 13:04:07 node06 kubelet[907]: E0905 13:04:07.763154     907 csi_mounter.go:366] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = NotFound desc = exit status 1

After deletion the kubelet logs are filled with:

Sep 05 13:05:46 node06 kubelet[907]: E0905 13:05:46.745519     907 kubelet_volumes.go:154] Orphaned pod "6e5ffb93-df89-41cf-98ea-a37a88b37616" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.

And indeed the pod PVC is still there, as the only one of the volumes:

root@node06:~# ls -l /var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~*
/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~csi:
total 4
drwxr-x--- 2 root root 4096 Sep  5 13:02 pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe

/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~empty-dir:
total 0

/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~nfs:
total 0

/var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~secret:
total 0

But has no mount

root@node06:~# ls -l /var/lib/kubelet/pods/6e5ffb93-df89-41cf-98ea-a37a88b37616/volumes/kubernetes.io~csi/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/
total 4
-rw-r--r-- 1 root root 339 Sep  5 13:04 vol_data.json

The globalmount still exists and throws the input/ouput error:

root@node06:~# ls -l /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe
ls: cannot access '/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-9174936d-f9d0-493b-87f2-49f1fb5835fe/globalmount': Input/output error
total 4
d????????? ? ?    ?      ?            ? globalmount
-rw-r--r-- 1 root root 152 Aug 12 09:58 vol_data.json
wilmardo commented 4 years ago

Found a workaround for anyone with the same problem:

This will fix the issue for some time, I will migrate to ext4 hopefully soon and update this issue to see if it resolves :)

Madhu-1 commented 4 years ago

Hey, @wilmardo any particular reason not to use krbd? we have not completely tested rbd-nbd yet

hope you have https://github.com/ceph/ceph-csi/pull/648 changes in your templates

wilmardo commented 4 years ago

@Madhu-1 Thanks for the pointer! This issue now is "resolved" by using the kernel mounter, running stable for more than a month now.

Should I leave this open since there for sure is an issue with the rbd-nbd?

vvavepacket commented 4 years ago

@wilmardo I'm also encountering the same issue, but this time I am using filestore. Any idea how can I execute the same command using filestore (I do not have real disk devices) but I just used directories

humblec commented 4 years ago

@Madhu-1 Thanks for the pointer! This issue now is "resolved" by using the kernel mounter, running stable for more than a month now.

Should I leave this open since there for sure is an issue with the rbd-nbd?

@wilmardo the issue with fuse or nbd here is a known one and we already have some trackers for it. Considering the same I would like to close this issue.

@vvavepacket I would suggest to open a new issue if you are facing any.

Closing this for now.