FlexVolume driver not correctly cleaning up after deleting LUN/ Persistent volume

dannert commented 5 years ago

While/after a FlexVolume driver LUN is unmapped / deleted from worker node the FlexVolume driver does not correctly remove mounts and multipath devices.

Expectation is that after a LUN is moved to another worker node or is deleted, that the original worker node this LUN was attached to is "clean" with entries in /dev/mapper, (multipath -l output) and any container mounts fully removed.

In my test the move of a LUN to another worker node left the device and mount in place and /var/log/messages shows these errors continuously: Jan 30 16:37:05 aop93cl124 hyperkube: E0130 16:37:05.071283 4536 kubelet_volumes.go:140] Orphaned pod "a0a236ad-23df-11e9-b738-fa99c511ef20" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them. Jan 30 16:37:05 aop93cl124 multipathd: mpathh: sdc - tur checker reports path is down Jan 30 16:37:06 aop93cl124 multipathd: mpathh: sdx - tur checker reports path is down Jan 30 16:37:07 aop93cl124 hyperkube: E0130 16:37:07.062533 4536 kubelet_volumes.go:140] Orphaned pod "a0a236ad-23df-11e9-b738-fa99c511ef20" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them. Jan 30 16:37:07 aop93cl124 multipathd: mpathh: sdo - tur checker reports path is down Jan 30 16:37:09 aop93cl124 hyperkube: E0130 16:37:09.068412 4536 kubelet_volumes.go:140] Orphaned pod "a0a236ad-23df-11e9-b738-fa99c511ef20" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them. Jan 30 16:37:09 aop93cl124 multipathd: mpathh: sdf - tur checker reports path is down Jan 30 16:37:09 aop93cl124 multipathd: mpathh: sdi - tur checker reports path is down Jan 30 16:37:09 aop93cl124 multipathd: mpathh: sdl - tur checker reports path is down Jan 30 16:37:09 aop93cl124 multipathd: mpathh: sdr - tur checker reports path is down Jan 30 16:37:09 aop93cl124 multipathd: mpathh: sdu - tur checker reports path is down Jan 30 16:37:10 aop93cl124 multipathd: mpathh: sdc - tur checker reports path is down

jdbenson59 commented 5 years ago

I have also seen this same behavior in recent PoC with customer usign ICP 3.1.1 and PowerVC Flex Volume Driver.

gautpras commented 5 years ago

I looked into this in more detail. I had earlier forgotten why we were not running udevadm commands too when volume is being detached.

This is a limitation that we have which is based on Flex volume driver design. The volume detachment process on flex volume driver happens as follows:

unmount() and unmount() device are called on worker nodes
detach() is called on controller. Since detach is called on controller after unmountDevice() is called on worker, we can not clear run the udevadm commands to clear the device as the volume is still attached. So the most the code can do is unmount the volume directory from the pods during detachment, and then subsequently when a volume is being attached to the worker VM, we run the udevadm commands which does both jobs of clearing the previous detached mappings and update new mappings.

Below are relevant logs: On worker: 019-02-25T21:03:09.73-06:00 main : DEBUG : The args to main are unmount [unmount /var/lib/kubelet/pods/ed3cfe61-3972-11e9-b711-fadd2e279820/volumes/ibm~power-k8s-volume-flex/nginx-cinder-vol-1] 2019-02-25T21:03:09.73-06:00 unmount : INFO : unmount called with /var/lib/kubelet/pods/ed3cfe61-3972-11e9-b711-fadd2e279820/volumes/ibm~power-k8s-volume-flex/nginx-cinder-vol-1 2019-02-25T21:03:09.779-06:00 main : INFO : Returning response {"status":"Success","message":"Unmounted volume directory /var/lib/kubelet/pods/ed3cfe61-3972-11e9-b711-fadd2e279820/volumes/ibm~power-k8s-volume-flex/nginx-cinder-vol-1"} 2019-02-25T21:03:09.833-06:00 main : DEBUG : The args to main are unmountdevice [unmountdevice /var/lib/kubelet/plugins/kubernetes.io/flexvolume/ibm/power-k8s-volume-flex/mounts/nginx-cinder-vol-1] 2019-02-25T21:03:09.833-06:00 unmountDevice : INFO : unmountDevice called with /var/lib/kubelet/plugins/kubernetes.io/flexvolume/ibm/power-k8s-volume-flex/mounts/nginx-cinder-vol-1 2019-02-25T21:03:09.93-06:00 main : INFO : Returning response {"status":"Success","message":"Operation Success"}

After the unmount() and unmountDevice() are successful, detach() is called on controller 2019-02-26T03:03:11.991Z main : DEBUG : The args to main are detach [detach nginx-cinder-vol-1 ..

jwcroppe commented 5 years ago

@gautpras Just thinking broadly for a moment - would it be possible to consider a periodic task of sorts that cleans these up vs. waiting for the volume event (which may never come in)?

rj00553657 commented 5 years ago

We have written a new api to run udevadm commands periodically in main.go. The setup-power-openstack-k8s-volume-flex.sh will run this api every 24 hours for cleaning up after deleting LUN/ Persistent volume. This uses the same lock which is used by waitForAttach api to lock and run udevadm commands.

gautpras commented 5 years ago

The above does not work. FVD pod does not have the view of its host's OS device map. So the cleanup code id not able to clean the multipath mappings on the host OS.

gautpras commented 5 years ago

We thought of cleaning up the multipath maps by trying out two approaches.

Tried creating a job from FVD pod to cleanup the host's multipath map. This does not work because the POD can not access the host's multipath map. This is expected else it would have been a security issue.
The only other FVD API that gets called out regularly on the worker when any volume is attached to it is the getvolumename API. We were mulling to add the cleanup logic here. But the risk of adding the cleanup logic here is that the cleaning up of the multipath maps is I/O operation which involve trying to get a lock. This means that the operation is also time consuming and has potential to get into a deadlock situation in worst case. This would mean a getvolumename call could take more time to be responded to from k8s, which is not very ideal. In worst case, this could affect k8s functionality when k8s might need to remove the attached volume. The other obvious argument of not adding the logic here is that this is not the correct place of doing the operation.

Based on the above, it seems not possible to cleanup the multipath maps using a periodic task. The best that FVD can do is cleanup the maps on next attach volume attempt.

We are still working on this path as this requires more functionality tests. So tagging this defect for next release of FVD 1.0.2.

gautpras commented 5 years ago

The above commit fixes the issue of cleaning up block devices during unmountdevice call itself. Earlier, the devices were not being cleaned because the volume was still attached. The fix still clears up the device because the the next call after unmountdevice is going to be detach from kubernetes which will eventually detach the volume. So the cleanup is eager cleanup.

IBM / power-openstack-k8s-volume-driver

FlexVolume driver not correctly cleaning up after deleting LUN/ Persistent volume #2