Don't track reattachment of provider-unrelated PVs

timebertt commented 3 months ago

/kind bug

What this PR does / why we need it:

When draining nodes, MCM already ignores provider-unrelated PVs when waiting for the detachments. However, it does wait for such PVs to reattach to other nodes until running into a timeout.

This PR ensures that the same filter of the driver is used for tracking reattachments of PVs as well.

For example, when draining nodes where NFS volumes are used, MCM would wait for the reattachment of these volumes. There is no such thing as an attachment to a node for NFS volumes. NFS volumes only need to be mounted (by kubelet). Accordingly, MCM always runs into the configured PV reattach timeout, if there are pods with NFS PVCs on the to-be-drained node. This prolongs the drain operation of such nodes severely.

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

cc @xoxys

Release note:

A bug has been fixed for draining nodes with provider-unrelated volumes like NFS volumes. With this fix, the machine controller doesn't try to track their (non-existing) VolumeAttachments.

gardener-robot commented 3 months ago

@timebertt Thank you for your contribution.

gardener-robot-ci-3 commented 3 months ago

Thank you @timebertt for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

timebertt commented 3 months ago

We probably need to add tests for this PR. However, we wanted to open it to get some feedback on whether this is the correct approach/place to solve the issue.

kon-angelo commented 3 months ago

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

timebertt commented 3 months ago

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

Sounds good. I had the impression that the attachment tracking doesn't work at all if there are no VolumeAttachment objects, i.e., no CSI driver is used, and the operation just runs into the configured timeout. Is this correct? If so, we should probably change this PR from "skip tracking NFS volumes" to "skip tracking non-CSI volumes".

Another PR could then enhance the check for CSI drivers where no VolumeAttachments are used.

timebertt commented 2 months ago

@kon-angelo @gardener/mcm-maintainers can you provide feedback on the PR and the above discussion?

unmarshall commented 2 months ago

@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false ? (Not in this PR even)

I agree to @kon-angelo's observation that this can be generically handled. @timebertt since you have raised this PR we can also do this in 2 steps. We can skip NFS volumes (as you have done in this PR) and then later we can generalise this. Since generalisation would required additional lookup of the CSIDriver object and looking up its AttachRequired property to decide if wait for detachment for such a PV can be skipped.

kon-angelo commented 2 months ago

Aside from my previous point, I kind of understand @timebertt 's suggestion. The MCM code relies on volumeattachments which also to my knowledge is only created for CSI volumes. We could probably update the logic in this PR to skip anything other than CSI volumes. (but my experience is just going through the code and this can be verified from one of the actual mcm maintainers).

Though one point is that I am not particularly fond that we "skip" reporting the existence of volumes, rather than not accounting them when draining - if it makes sense. Like this can be somewhat troubling to debug I find. But the actual function doing the work evictPodsWithPVInternal that also uses the getVolIDsFromDriver is used in many places and so I understand this implementation.

I just feel that at some point we will go down a debugging rabbit hole because getVolIDsFromDriver also does a little more extra than the name suggests which is the skipping part 🤷

timebertt commented 1 month ago

Back from vacation, sorry for the delay.

I just feel that at some point we will go down a debugging rabbit hole because getVolIDsFromDriver also does a little more extra than the name suggests which is the skipping part 🤷

Got your point. I will rework the PR to skip any non-CSI volumes instead of handling NFS as a special case, improve the code flow (where the filtering is done), and add unit tests.

Another PR can be opened later to consult the CSIDriver object for the attachRequired == false check.

timebertt commented 1 month ago

While looking into the filtering again, I noticed a misunderstanding on my side. In short, the filtering I introduced in this PR and the filtering we envisioned wouldn't solve the problem at hand.

The actual problem is that PodVolumeInfo returned by getPodVolumeInfos has two different lists of volumes.

persistentVolumeList contains the referenced PV names of all PVCs referenced in the pod. This list is not filtered and is only used in waitForReattach.
volumeList contains the volume IDs/handles corresponding to persistentVolumeList. It could also be a subset thereof as it is filtered in Driver.GetVolumeIDs when mapping PV names to IDs. E.g., for openstack, this list only contains cinder volumes (in-tree or CSI). This list is only used in waitForDetach.

See doAccountingOfPvs: https://github.com/gardener/machine-controller-manager/blob/97cd75245dacc4e40327ded2f28a37a4a3f42cf1/pkg/util/provider/drain/drain.go#L510-L542

So when doing the filtering in getVolIDsFromDriver (as originally proposed by this PR), we only filter the volumeList used in waitForDetach. However, this list never contained NFS or non-CSI volumes as it is already filtered by the driver.

The problem is that the persistentVolumeList is not filtered and the machine controller waits for reattachment of NFS or non-CSI volumes until it runs into a timeout.

To solve the problem, I reworked the PR so that there is only a single volume list in PodVolumeInfo that contains both types of information (PV name and volume ID). Only if the driver returns a volume ID, the PV is added to the list. Otherwise, we don't track detachment (this was already the case before) and don't track reattachment (new, this is the fix).

The PR is still missing unit tests. First, I want to reproduce the issue on a test cluster and ensure that the change fixes the issue. In the meantime, feedback on the change is highly appreciated.

timebertt commented 1 month ago

I reproduced the issue described above on an OpenStack shoot using the following steps:

Install https://github.com/rancher/local-path-provisioner for creating volumes that are filtered out by machine-controller-manager-provider-openstack in GetVolumeIDs.
```
kustomize build "github.com/rancher/local-path-provisioner/deploy?ref=v0.0.30" | kubectl apply -f -
```
Create an example StatefulSet using a local-path-provisioner PVC:
Click me to expand

apiVersion: apps/v1
kind: StatefulSet
metadata:
  creationTimestamp: null
  labels:
    app: test
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: test
    spec:
      containers:
      - name: volume-test
        image: nginx:stable-alpine
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - name: test
          mountPath: /data
        ports:
        - containerPort: 80
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: test
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 128Mi
      storageClassName: local-path

Delete the machine where the test pod runs and test volume is mounted:

$ k get po -owide
NAME     READY   STATUS    RESTARTS   AGE   IP              NODE                                              NOMINATED NODE   READINESS GATES
test-0   1/1     Running   0          50s   100.80.121.84   shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p   <none>           <none>
$ k get pvc
NAME          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-test-0   Bound    pvc-283f2aef-3066-4b23-bc61-30e750ea5340   128Mi      RWO            local-path     4m30s
$ gardenctl target control-plane
$ k delete machine shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p

Observe the drain operation in the MCM logs:


$ stern machine | grep -e shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p -e attach -e detach
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:14:13.429520       1 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p" with providerID "openstack:///RegionOne/0edc5a3c-7ea3-4fd8-9b45-a75e386f178a" and backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p" with drain-timeout:2h0m0s & maxEvictRetries:360

machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:14:13.634676 1 drain.go:834] Node name: "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p", list of pod PVs to wait for detach: [] machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:14:13.634699 1 drain.go:914] Waiting for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]

machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:15:43.635770 1 drain.go:932] Timeout occurred while waiting for PVs [pvc-283f2aef-3066-4b23-bc61-30e750ea5340] to reattach to a different node machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:15:43.635857 1 drain.go:771] Timeout occurred for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]

machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:15:43.635954 1 drain.go:781] Pod + volume detachment from node shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p + volume reattachment to another node for Pod default/test-0 took 1m30.05787152s machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:15:43.636047 1 drain.go:200] Machine drain ended on 2024-10-10 12:15:43.636038402 +0000 UTC m=+2075.047348776 and took 1m30.163757152s for "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p"



The logs show that `waitForDetach` skips the non-cinder volume, as `volumeList` is filtered correctly: `list of pod PVs to wait for detach: []`.
However, `waitForReattach` still waits for the volume to reattach, as `persistentVolumeList` is not filtered: `Waiting for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]`.
With this, MCM runs into the configured `--machine-pv-reattach-timeout` (90s by default).

The 90s for this unnecessary attachment tracking sums up quickly when rolling node pools having many provider-unrelated PVs.

timebertt commented 1 month ago

Now, I verified my fix with a manually built image of mcm-provider-openstack and the steps described above.

With the change, both lists for detach and reattach tracking are empty and the drain operation finishes quickly without considering the non-cinder volume (as desired):

I1010 15:57:11.162420       1 machine_util.go:1130] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j" with providerID "openstack:///RegionOne/accb65ba-9cae-47b5-984d-0598daaae381" and backing node "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j" with drain-timeout:2h0m0s & maxEvictRetries:360

W1010 15:57:11.329585       1 drain.go:865] Node name: "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j", list of pod PVs to wait for detach: []
W1010 15:57:11.329639       1 drain.go:941] List of pod PVs waiting for reattachment is 0: []
I1010 15:57:11.329721       1 drain.go:810] Pod + volume detachment from node shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j + volume reattachment to another node for Pod default/test-0 took 30.042329ms

I1010 15:57:11.329761       1 drain.go:228] Machine drain ended on 2024-10-10 15:57:11.329751778 +0000 UTC m=+486.934211889 and took 126.250956ms for "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j"

I also tested the change with a node that had 3 cinder volumes attached:

I1010 15:59:46.496527       1 machine_util.go:1130] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" with providerID "openstack:///RegionOne/1fa6c33c-cb16-4ae9-b70e-7387aaf938cf" and backing node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" with drain-timeout:2h0m0s & maxEvictRetries:360
I1010 15:59:46.684754       1 drain.go:574] PodVolumeInfos: map[garden/prometheus-aggregate-0:{[{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]} garden/prometheus-cache-0:{[{pv-shoot--garden--s-eu01-000-21332da3-f21d-41d3-8c06-f2fdef77c9cf b3c6dc59-ec47-4486-bc0a-844824a5081e}]} garden/vali-0:{[{pv-shoot--garden--s-eu01-000-cb1a711c-1c05-468a-8266-025acf81dff3 3aad22f5-cade-44a2-aae9-bf8305b73bc1}]}]

I1010 15:59:46.753823       1 drain.go:763] Pod eviction/deletion for Pod garden/prometheus-aggregate-0 in Node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" and took 68.895848ms. Now waiting for volume detachment.
I1010 15:59:46.753853       1 drain.go:869] Waiting for following volumes to detach: [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]
I1010 15:59:46.909744       1 drain.go:892] No of attached volumes for node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" is [{kubernetes.io/csi/cinder.csi.openstack.org^b3c6dc59-ec47-4486-bc0a-844824a5081e } {kubernetes.io/csi/cinder.csi.openstack.org^668f1bc6-3c5e-4683-bd34-b26d0874ffd1 } {kubernetes.io/csi/cinder.csi.openstack.org^3aad22f5-cade-44a2-aae9-bf8305b73bc1 }]
I1010 15:59:46.909938       1 drain.go:908] Found volume "668f1bc6-3c5e-4683-bd34-b26d0874ffd1" still attached to node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7". Will re-check in 5s
...
I1010 15:59:53.514517       1 volume_attachment.go:42] Dispatching request for PV pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03
I1010 15:59:56.928776       1 drain.go:892] No of attached volumes for node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" is [{kubernetes.io/csi/cinder.csi.openstack.org^b3c6dc59-ec47-4486-bc0a-844824a5081e } {kubernetes.io/csi/cinder.csi.openstack.org^3aad22f5-cade-44a2-aae9-bf8305b73bc1 }]
I1010 15:59:56.929045       1 drain.go:921] Detached volumes [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}] from node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7"
I1010 15:59:56.929243       1 drain.go:787] Pod + volume detachment from Node garden for Pod prometheus-aggregate-0/shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7 and took 10.244327791s
I1010 15:59:56.929338       1 drain.go:945] Waiting for following volumes to reattach: [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]

Now, that the fix is verified and I'm confident in the change of this PR, I only need to add unit tests.

While verifying my change, I discovered another bug (unrelated to this change, but it breaks the reattach tracking): https://github.com/gardener/machine-controller-manager/issues/945

timebertt commented 1 month ago

Hey @gardener/mcm-maintainers @kon-angelo. I finished verifying this PR and adding tests. It's ready for a full review now :)

elankath commented 1 month ago

@timebertt sorry for dumb question, but was there any additional code introduced for NFS filtering ? Since you raised that as as point, but we can't find the same. I am assuming that the standard filtering that is currently implemented in the provider implemention for Driver.GetVolumeIDs already does this job, right ? Or am I mistaken ?

timebertt commented 1 month ago

was there any additional code introduced for NFS filtering ? [...] I am assuming that the standard filtering that is currently implemented in the provider implemention for Driver.GetVolumeIDs already does this job, right ?

Yes, the filtering in Driver.GetVolumeIDs does the job. Before this PR, the filtering was only applied to the input of waitForDetach but not to the input of waitForReattach.

timebertt commented 1 month ago

@elankath can you take another look please?

elankath commented 1 month ago

@timebertt Just to confirm: was the force-drain logic also tested ? One can test with by using k label mc <machineName> force-deletion=true. This does deletion of volume attachments which wont be valid for this use-case ? Also - just to confirm for clarity of understanding- there is no explicit filtering for NFS volumes anywhere in MCM right, ie we are depending on the provider impl doing the filtering.

Thanks!

timebertt commented 1 month ago

was the force-drain logic also tested ? One can test with by using k label mc <machineName> force-deletion=true. This does deletion of volume attachments which wont be valid for this use-case ?

I just tested the force deletion of a machine that had three cinder volumes and one local-path volume (see steps to reproduce: https://github.com/gardener/machine-controller-manager/pull/937#issuecomment-2404928585). MCM successfully deleted all pods and VolumeAttachments on the node as expected:

I1018 06:33:31.228271       1 machine_util.go:1121] Force delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" with providerID "openstack:///RegionOne/40370abf-174b-4872-83fa-343339a0ccaa" and backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" due to Label:true, timeout:false
I1018 06:33:31.421384       1 drain.go:490] Forceful eviction of pods on the node: "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I1018 06:33:31.421736       1 drain.go:1028] Evicting pod garden/vali-0 from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.421798       1 drain.go:387] Attempting to force-delete the pod:"vali-0" from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I1018 06:33:31.422110       1 drain.go:1028] Evicting pod default/test-0 from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.422129       1 drain.go:387] Attempting to force-delete the pod:"test-0" from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I1018 06:33:31.489072       1 drain.go:228] Machine drain ended on 2024-10-18 06:33:31.489041721 +0000 UTC m=+515.338557327 and took 216.774911ms for "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.489190       1 machine_util.go:1185] Drain successful for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" ,providerID "openstack:///RegionOne/40370abf-174b-4872-83fa-343339a0ccaa", backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr".

I1018 06:33:36.515974       1 machine.go:116] reconcileClusterMachine: Start for "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" with phase:"Terminating", description:"Force Drain successful. Delete Volume Attachments"
I1018 06:33:36.516125       1 machine_util.go:1683] (deleteVolumeAttachmentsForNode) Deleting #3 VolumeAttachment(s) for node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:36.562143       1 machine_util.go:1269] (deleteNodeVolAttachments) Successfully deleted all volume attachments for node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr", machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"

I looked into the code and found that the deletion of VolumeAttachments in case of a force deletion is not related to the drain code that I changed in this PR. It also doesn't use the driver at all and just looks at the VolumeAttachments objects. Why do you think that this functionality might be affected by this PR?

timebertt commented 1 month ago

Also - just to confirm for clarity of understanding- there is no explicit filtering for NFS volumes anywhere in MCM right, ie we are depending on the provider impl doing the filtering.

Yes, exactly.

elankath commented 1 month ago

Thanks for the clarification and test Tim. (The test is not related to your PR - just a basic sanity check just to check that the force-deletion flow was not impacted by drain)

timebertt commented 1 month ago

I think https://github.com/gardener/machine-controller-manager/pull/937#discussion_r1801219293 is still open. Can you take a look at my thoughts?

elankath commented 1 month ago

No, I believe there is no need to restructure. Ideally we should have have had a better response instead of just a []string but that will need to wait for the MCM overhaul.

rishabh-11 commented 1 month ago

tests passed on AWS. I'll fix https://github.com/gardener/machine-controller-manager/issues/945. Thanks for raising this issue.

gardener / machine-controller-manager

Don't track reattachment of provider-unrelated PVs #937