fluid-cloudnative / fluid

Fluid, elastic data abstraction and acceleration for BigData/AI applications in cloud. (Project under CNCF)
https://fluid-cloudnative.github.io/
Apache License 2.0
1.64k stars 955 forks source link

[BUG] NodeUnpublishVolume break for-loop too early to clean all the corrupted mount points #4189

Closed TrafalgarZZZ closed 3 months ago

TrafalgarZZZ commented 3 months ago

What is your environment(Kubernetes version, Fluid version, etc.) Fluid 1.0

Describe the bug When there are multiple corrupted mount points on a fluid volume's targetPath (This should be common when enabling FUSE Recovery feature), it is Fluid CSI Plugin's responsibility to umount all the corrupted mount points in func NodeUnpublishVolume. However, the codebase now returns early without umounting all of them.

A brief go-through of the code:

  1. csi plugin checks if targetPath is likely a mount point. If the targetPath is a corrupted mount point, the func returns true(notMount) and corruptedErr(err). (code)
  2. For corruptedErr, we only log such case and keep going down. (code)
  3. For notMount == true, csi plugin thinks its work is done. But actually there might be multiple corrupted mount points on targetPath (code)
  4. Finally, csi plugin calls func CleanupMountPoint and umount once with other corrupted mount points still left.

Luckily, all the corrupted mount points will eventually umounted because kubelet retries. For example, when tearing down a targetPath with 3 corrupted mount points, kubelet reports like:

Jul 02 17:58:56 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: E0702 17:58:56.346277    2656 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fuse.csi.fluid.io^default-demo podName:5309cb53-14cd-420c-bc4a-75d9dc31e759 nodeName:}" failed. No retries permitted until 2024-07-02 17:58:56.846259243 +0800 CST m=+2149126.490387342 (durationBeforeRetry 500ms). Error: UnmountVolume.TearDown failed for volume "data-vol" (UniqueName: "kubernetes.io/csi/fuse.csi.fluid.io^default-demo") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = NodeUnpublishVolume: failed when cleanupMountPoint on path /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: stat /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: transport endpoint is not connected
Jul 02 17:58:56 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: E0702 17:58:56.848552    2656 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fuse.csi.fluid.io^default-demo podName:5309cb53-14cd-420c-bc4a-75d9dc31e759 nodeName:}" failed. No retries permitted until 2024-07-02 17:58:57.848525874 +0800 CST m=+2149127.492653986 (durationBeforeRetry 1s). Error: UnmountVolume.TearDown failed for volume "data-vol" (UniqueName: "kubernetes.io/csi/fuse.csi.fluid.io^default-demo") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = NodeUnpublishVolume: failed when cleanupMountPoint on path /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: stat /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: transport endpoint is not connected
Jul 02 17:58:57 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: E0702 17:58:57.855291    2656 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/fuse.csi.fluid.io^default-demo podName:5309cb53-14cd-420c-bc4a-75d9dc31e759 nodeName:}" failed. No retries permitted until 2024-07-02 17:58:59.855273142 +0800 CST m=+2149129.499401242 (durationBeforeRetry 2s). Error: UnmountVolume.TearDown failed for volume "data-vol" (UniqueName: "kubernetes.io/csi/fuse.csi.fluid.io^default-demo") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759") : kubernetes.io/csi: Unmounter.TearDownAt failed: rpc error: code = Internal desc = NodeUnpublishVolume: failed when cleanupMountPoint on path /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: stat /var/lib/kubelet/pods/5309cb53-14cd-420c-bc4a-75d9dc31e759/volumes/kubernetes.io~csi/default-demo/mount: transport endpoint is not connected
Jul 02 17:58:59 iZ2zec8bxibjv02csp8chtZ kubelet[2656]: I0702 17:58:59.869373    2656 operation_generator.go:888] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/fuse.csi.fluid.io^default-demo" (OuterVolumeSpecName: "data-vol") pod "5309cb53-14cd-420c-bc4a-75d9dc31e759" (UID: "5309cb53-14cd-420c-bc4a-75d9dc31e759"). InnerVolumeSpecName "default-demo". PluginName "kubernetes.io/csi", VolumeGidValue ""

What you expect to happen: This logic can be optimized by checking corrupted mount points when calling func NodeUnpublishVolume

How to reproduce it

  1. enable FUSE Recovery
  2. delete FUSE Pod some times
  3. delete app pod and check Fluid CSI Plugin's logs.

Additional Information

TrafalgarZZZ commented 3 months ago

/assign @TrafalgarZZZ