What is your environment(Kubernetes version, Fluid version, etc.)
Fluid 1.0
Describe the bug
When there are multiple corrupted mount points on a fluid volume's targetPath (This should be common when enabling FUSE Recovery feature), it is Fluid CSI Plugin's responsibility to umount all the corrupted mount points in func NodeUnpublishVolume. However, the codebase now returns early without umounting all of them.
A brief go-through of the code:
csi plugin checks if targetPath is likely a mount point. If the targetPath is a corrupted mount point, the func returns true(notMount) and corruptedErr(err). (code)
For corruptedErr, we only log such case and keep going down. (code)
For notMount == true, csi plugin thinks its work is done. But actually there might be multiple corrupted mount points on targetPath (code)
Finally, csi plugin calls func CleanupMountPoint and umount once with other corrupted mount points still left.
Luckily, all the corrupted mount points will eventually umounted because kubelet retries. For example, when tearing down a targetPath with 3 corrupted mount points, kubelet reports like:
Jul 02 17:58:56 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: E0702 17:58:56.346277 2656 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fuse.csi.fluid.io^default-demo podName:5309cb53-14cd-420c-bc4a-75d9dc31e759 nodeName:}" failed. No retries permitted until 2024-07-02 17:58:56.846259243 +0800 CST m=+2149126.490387342 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "data-vol" (UniqueName: "kubernetes.io/csi/fuse.csi.fluid.io^default-demo") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = NodeUnpublishVolume: failed when cleanupMountPoint on path /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: stat /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: transport endpoint is not connected
Jul 02 17:58:56 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: E0702 17:58:56.848552 2656 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fuse.csi.fluid.io^default-demo podName:5309cb53-14cd-420c-bc4a-75d9dc31e759 nodeName:}" failed. No retries permitted until 2024-07-02 17:58:57.848525874 +0800 CST m=+2149127.492653986 (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "data-vol" (UniqueName: "kubernetes.io/csi/fuse.csi.fluid.io^default-demo") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = NodeUnpublishVolume: failed when cleanupMountPoint on path /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: stat /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: transport endpoint is not connected
Jul 02 17:58:57 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: E0702 17:58:57.855291 2656 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fuse.csi.fluid.io^default-demo podName:5309cb53-14cd-420c-bc4a-75d9dc31e759 nodeName:}" failed. No retries permitted until 2024-07-02 17:58:59.855273142 +0800 CST m=+2149129.499401242 (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "data-vol" (UniqueName: "kubernetes.io/csi/fuse.csi.fluid.io^default-demo") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = NodeUnpublishVolume: failed when cleanupMountPoint on path /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: stat /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: transport endpoint is not connected
Jul 02 17:58:59 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: I0702 17:58:59.869373 2656 operation_generator.go:888] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/fuse.csi.fluid.io^default-demo" (OuterVolumeSpecName: "data-vol") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759"). InnerVolumeSpecName "default-demo". PluginName "kubernetes.io/csi", VolumeGidValue ""
What you expect to happen:
This logic can be optimized by checking corrupted mount points when calling func NodeUnpublishVolume
What is your environment(Kubernetes version, Fluid version, etc.) Fluid 1.0
Describe the bug When there are multiple corrupted mount points on a fluid volume's targetPath (This should be common when enabling FUSE Recovery feature), it is Fluid CSI Plugin's responsibility to umount all the corrupted mount points in
func NodeUnpublishVolume
. However, the codebase now returns early without umounting all of them.A brief go-through of the code:
notMount == true
, csi plugin thinks its work is done. But actually there might be multiple corrupted mount points ontargetPath
(code)func CleanupMountPoint
and umount once with other corrupted mount points still left.Luckily, all the corrupted mount points will eventually umounted because kubelet retries. For example, when tearing down a
targetPath
with 3 corrupted mount points, kubelet reports like:What you expect to happen: This logic can be optimized by checking corrupted mount points when calling
func NodeUnpublishVolume
How to reproduce it
Additional Information