ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

csi-cephfsplugin After the plug-in service is restarted, the client service cannot mount the storage disk #1819

Closed lixiaopengy closed 3 years ago

lixiaopengy commented 3 years ago

Environmental inventory: kubernetes kubeadm v1.17 docker: 19.03.4 ceph: 14.2.11 csi: v3.1.2

Problem phenomenon After the CSI cephfsplugin service is restarted, the client has the following problems. You need to restart the client service container to return to normal

root@nginx-cephfs1-77664d8bfb-mpbwp:/# df -hT
df: /usr/share/nginx/html: Transport endpoint is not connected
Filesystem              Type     Size  Used Avail Use% Mounted on
overlay                 overlay  112G   31G   81G  28% /
tmpfs                   tmpfs     64M     0   64M   0% /dev
tmpfs                   tmpfs    7.9G     0  7.9G   0% /sys/fs/cgroup
/dev/mapper/centos-home xfs      112G   31G   81G  28% /root
shm                     tmpfs     64M     0   64M   0% /dev/shm
tmpfs                   tmpfs    7.9G   12K  7.9G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                   tmpfs    7.9G     0  7.9G   0% /proc/acpi
tmpfs                   tmpfs    7.9G     0  7.9G   0% /proc/scsi
tmpfs                   tmpfs    7.9G     0  7.9G   0% /sys/firmware

csi-cephfsplugin:

I1231 06:15:10.002371       1 utils.go:160] ID: 32 GRPC request: {}
I1231 06:15:10.006170       1 utils.go:165] ID: 32 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I1231 06:15:10.008460       1 utils.go:159] ID: 33 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1231 06:15:10.010427       1 utils.go:160] ID: 33 GRPC request: {"volume_id":"0001-0024-9771f5aa-191c-4e16-9524-19dc31d2cd8d-0000000000000001-fa920d17-4b2d-11eb-be5f-bacc5498d6bd","volume_path":"/home/kubelet/pods/e05631d7-e406-4799-b7a5-dc8819439dd1/volumes/kubernetes.io~csi/pvc-1f3ed29b-2709-4ce1-9dfc-c8b201c6711b/mount"}
E1231 06:15:10.023137       1 utils.go:163] ID: 33 GRPC error: rpc error: code = Internal desc = stat /home/kubelet/pods/e05631d7-e406-4799-b7a5-dc8819439dd1/volumes/kubernetes.io~csi/pvc-1f3ed29b-2709-4ce1-9dfc-c8b201c6711b/mount: transport endpoint is not connected
I1231 06:15:15.748857       1 utils.go:159] ID: 34 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1231 06:15:15.750801       1 utils.go:160] ID: 34 GRPC request: {}
I1231 06:15:15.755824       1 utils.go:165] ID: 34 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I1231 06:15:15.757638       1 utils.go:159] ID: 35 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1231 06:15:15.758772       1 utils.go:160] ID: 35 GRPC request: {"volume_id":"0001-0024-9771f5aa-191c-4e16-9524-19dc31d2cd8d-0000000000000001-52fd8606-40fd-11eb-8ffe-eaf01d57385a","volume_path":"/home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount"}
E1231 06:15:15.767358       1 utils.go:163] ID: 35 GRPC error: rpc error: code = Internal desc = stat /home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount: transport endpoint is not connected
I1231 06:16:03.804767       1 utils.go:159] ID: 36 GRPC call: /csi.v1.Identity/Probe
I1231 06:16:03.805863       1 utils.go:160] ID: 36 GRPC request: {}
I1231 06:16:03.807203       1 utils.go:165] ID: 36 GRPC response: {}
I1231 06:16:43.208991       1 utils.go:159] ID: 37 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1231 06:16:43.210732       1 utils.go:160] ID: 37 GRPC request: {}
I1231 06:16:43.214016       1 utils.go:165] ID: 37 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I1231 06:16:43.225537       1 utils.go:159] ID: 38 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1231 06:16:43.227063       1 utils.go:160] ID: 38 GRPC request: {"volume_id":"0001-0024-9771f5aa-191c-4e16-9524-19dc31d2cd8d-0000000000000001-52fd8606-40fd-11eb-8ffe-eaf01d57385a","volume_path":"/home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount"}
E1231 06:16:43.239974       1 utils.go:163] ID: 38 GRPC error: rpc error: code = Internal desc = stat /home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount: transport endpoint is not connected
I1231 06:17:03.805346       1 utils.go:159] ID: 39 GRPC call: /csi.v1.Identity/Probe
I1231 06:17:03.807484       1 utils.go:160] ID: 39 GRPC request: {}
I1231 06:17:03.808654       1 utils.go:165] ID: 39 GRPC response: {}
I1231 06:17:03.988552       1 utils.go:159] ID: 40 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1231 06:17:03.989213       1 utils.go:160] ID: 40 GRPC request: {}
I1231 06:17:03.990647       1 utils.go:165] ID: 40 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I1231 06:17:03.992281       1 utils.go:159] ID: 41 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1231 06:17:03.993917       1 utils.go:160] ID: 41 GRPC request: {"volume_id":"0001-0024-9771f5aa-191c-4e16-9524-19dc31d2cd8d-0000000000000001-fa920d17-4b2d-11eb-be5f-bacc5498d6bd","volume_path":"/home/kubelet/pods/e05631d7-e406-4799-b7a5-dc8819439dd1/volumes/kubernetes.io~csi/pvc-1f3ed29b-2709-4ce1-9dfc-c8b201c6711b/mount"}
E1231 06:17:04.009064       1 utils.go:163] ID: 41 GRPC error: rpc error: code = Internal desc = stat /home/kubelet/pods/e05631d7-e406-4799-b7a5-dc8819439dd1/volumes/kubernetes.io~csi/pvc-1f3ed29b-2709-4ce1-9dfc-c8b201c6711b/mount: transport endpoint is not connected
I1231 06:17:53.322064       1 utils.go:159] ID: 42 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1231 06:17:53.327475       1 utils.go:160] ID: 42 GRPC request: {}
I1231 06:17:53.331634       1 utils.go:165] ID: 42 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I1231 06:17:53.334161       1 utils.go:159] ID: 43 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1231 06:17:53.335814       1 utils.go:160] ID: 43 GRPC request: {"volume_id":"0001-0024-9771f5aa-191c-4e16-9524-19dc31d2cd8d-0000000000000001-52fd8606-40fd-11eb-8ffe-eaf01d57385a","volume_path":"/home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount"}
E1231 06:17:53.345216       1 utils.go:163] ID: 43 GRPC error: rpc error: code = Internal desc = stat /home/kubelet/pods/ae344b80-3b07-4589-b1a1-ca75fa9debf2/volumes/kubernetes.io~csi/pvc-ec69de59-7823-4840-8eee-544f8261fef0/mount: transport endpoint is not connected
I1231 06:18:03.804955       1 utils.go:159] ID: 44 GRPC call: /csi.v1.Identity/Probe
I1231 06:18:03.806929       1 utils.go:160] ID: 44 GRPC request: {}
I1231 06:18:03.808853       1 utils.go:165] ID: 44 GRPC response: {}
I1231 06:18:18.573703       1 utils.go:159] ID: 45 GRPC call: /csi.v1.Node/NodeGetCapabilities
I1231 06:18:18.575350       1 utils.go:160] ID: 45 GRPC request: {}
I1231 06:18:18.583726       1 utils.go:165] ID: 45 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}}]}
I1231 06:18:18.585569       1 utils.go:159] ID: 46 GRPC call: /csi.v1.Node/NodeGetVolumeStats
I1231 06:18:18.586131       1 utils.go:160] ID: 46 GRPC request: {"volume_id":"0001-0024-9771f5aa-191c-4e16-9524-19dc31d2cd8d-0000000000000001-fa920d17-4b2d-11eb-be5f-bacc5498d6bd","volume_path":"/home/kubelet/pods/e05631d7-e406-4799-b7a5-dc8819439dd1/volumes/kubernetes.io~csi/pvc-1f3ed29b-2709-4ce1-9dfc-c8b201c6711b/mount"}
E1231 06:18:18.599702       1 utils.go:163] ID: 46 GRPC error: rpc error: code = Internal desc = stat /home/kubelet/pods/e05631d7-e406-4799-b7a5-dc8819439dd1/volumes/kubernetes.io~csi/pvc-1f3ed29b-2709-4ce1-9dfc-c8b201c6711b/mount: transport endpoint is not connected
I1231 06:19:03.805562       1 utils.go:159] ID: 47 GRPC call: /csi.v1.Identity/Probe
I1231 06:19:03.807591       1 utils.go:160] ID: 47 GRPC request: {}
I1231 06:19:03.809026       1 utils.go:165] ID: 47 GRPC response: {}

If you need other service logs, please contact me. Thank you for your help

Madhu-1 commented 3 years ago

@lixiaopengxiya fuse-client is not supported in cephcsi for now, closing this for the same reason. Please reopen this one if you have used kernel mounter not the fuse.

cl51287 commented 3 years ago

@Madhu-1 I saw the document that fuse is used as the client by default, and the kernel mode does not support quota. Is there any way to solve this problem?

Madhu-1 commented 3 years ago

Fuse is default but not production supported still under development in Cephcsi as it is having some issue. Using kernel client should fix the issue. Fuse is having a restart issue where we will get Transport endpoint not connected error. We are working on a new design to support the fuse client.

cl51287 commented 3 years ago

Thank you for your reply. The inability to use qouta in kernel mode may cause system risks. When is the new design expected to be available?

Madhu-1 commented 3 years ago

Currently, we don't have the exact date.

Huweicai commented 3 years ago

Fuse is default but not production supported still under development in Cephcsi as it is having some issue. Using kernel client should fix the issue. Fuse is having a restart issue where we will get Transport endpoint not connected error. We are working on a new design to support the fuse client.

Can you talk more about your new design on how fuse client avoid the restart problem, I've been stuck with the question these days, too. Thanks!

Madhu-1 commented 3 years ago

There is no design doc for now. Once we have something will create a design doc PR

Huweicai commented 3 years ago

There is no design doc for now. Once we have something will create a design doc PR

Thanks, hope everything goes smoothly.

cl51287 commented 2 years ago

@Madhu-1 I found that the 3.4.0 version contains rbd-nbd volume healer support. Has this problem been solved?

Madhu-1 commented 2 years ago

@cl51287 yes its is alpha support and it's only for RBD not for CephFS.

humblec commented 2 years ago

@Madhu-1 I found that the 3.4.0 version contains rbd-nbd volume healer support. Has this problem been solved?

We have introduced a method here to take care of the plugin service restart of user space mounted rbd volumes, so it is worth to experiment with nbd mounter if you are on RBD PVs and wanted to use user space mounter instead of krbd.

cl51287 commented 2 years ago

@humblec We just want to use rbd, and want to use user space mounted, but our version of kubernetes is still relatively low, and it is estimated that it will not be used for the time being. How is the problem solved?

cl51287 commented 2 years ago

@Madhu-1 I found that 3.6.0 already supports fuse remounting, has this problem been solved? In addition, does rbd-nbd also support this feature?

Madhu-1 commented 2 years ago

@Madhu-1 I found that 3.6.0 already supports fuse remounting, has this problem been solved? In addition, does rbd-nbd also support this feature?

Yes, we added support for nbd already. cc @pkalever

cl51287 commented 2 years ago

@Madhu-1 When we used 3.6.2, we found that the configuration in the document could not solve the problem. When the daemonset of csi was restarted, the corresponding fuse process still disappeared, and the mount also failed. I don't know if it is a problem. Our kubernetes version is 1.14, we have configured netNamespaceFilepath as the net file of the No. 1 process of the host (other pods of the host have also tried), and found that the command is indeed correct through dlv debugging, using the nsenter command to execute , but when the daemonset of the csi is restarted, the mount still fails.

Madhu-1 commented 2 years ago

@Madhu-1 When we used 3.6.2, we found that the configuration in the document could not solve the problem. When the daemonset of csi was restarted, the corresponding fuse process still disappeared, and the mount also failed. I don't know if it is a problem.

automatic recovery of fuse is still not supported if you restart the application pod or create one more pod on the same node to use same PVC the mount should get recover.

Our kubernetes version is 1.14, we have configured netNamespaceFilepath as the net file of the No. 1 process of the host (other pods of the host have also tried), and found that the command is indeed correct through dlv debugging, using the nsenter command to execute , but when the daemonset of the csi is restarted, the mount still fails.

nsenter is not related to fuse. it only for the pod networking

cl51287 commented 2 years ago

@Madhu-1 Will the next version solve this problem? What does the documentation say to solve this problem?

Madhu-1 commented 2 years ago

@Madhu-1 Will the next version solve this problem? What does the documentation say to solve this problem?

Auto recovery is not planned yet. you can use the existing mechanism https://github.com/ceph/ceph-csi/blob/devel/docs/ceph-fuse-corruption.md which is supported in 3.6.0 release.

cl51287 commented 2 years ago

@Madhu-1 Thank you very much