Closed timebertt closed 1 month ago
@timebertt Thank you for your contribution.
Thank you @timebertt for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.
We probably need to add tests for this PR. However, we wanted to open it to get some feedback on whether this is the correct approach/place to solve the issue.
@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with attachRequired == false
? (Not in this PR even)
@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with
attachRequired == false
? (Not in this PR even)
Sounds good. I had the impression that the attachment tracking doesn't work at all if there are no VolumeAttachment
objects, i.e., no CSI driver is used, and the operation just runs into the configured timeout. Is this correct?
If so, we should probably change this PR from "skip tracking NFS volumes" to "skip tracking non-CSI volumes".
Another PR could then enhance the check for CSI drivers where no VolumeAttachments
are used.
@kon-angelo @gardener/mcm-maintainers can you provide feedback on the PR and the above discussion?
@timebertt do you think, we should generalise this to also skip the check in case the CSIDriver is configured with
attachRequired == false
? (Not in this PR even)
I agree to @kon-angelo's observation that this can be generically handled. @timebertt since you have raised this PR we can also do this in 2 steps. We can skip NFS volumes (as you have done in this PR) and then later we can generalise this. Since generalisation would required additional lookup of the CSIDriver object and looking up its AttachRequired
property to decide if wait for detachment for such a PV can be skipped.
Aside from my previous point, I kind of understand @timebertt 's suggestion. The MCM code relies on volumeattachments which also to my knowledge is only created for CSI volumes. We could probably update the logic in this PR to skip anything other than CSI volumes. (but my experience is just going through the code and this can be verified from one of the actual mcm maintainers).
Though one point is that I am not particularly fond that we "skip" reporting the existence of volumes, rather than not accounting them when draining - if it makes sense. Like this can be somewhat troubling to debug I find. But the actual function doing the work evictPodsWithPVInternal
that also uses the getVolIDsFromDriver
is used in many places and so I understand this implementation.
I just feel that at some point we will go down a debugging rabbit hole because getVolIDsFromDriver
also does a little more extra than the name suggests which is the skipping part 🤷
Back from vacation, sorry for the delay.
I just feel that at some point we will go down a debugging rabbit hole because
getVolIDsFromDriver
also does a little more extra than the name suggests which is the skipping part 🤷
Got your point. I will rework the PR to skip any non-CSI volumes instead of handling NFS as a special case, improve the code flow (where the filtering is done), and add unit tests.
Another PR can be opened later to consult the CSIDriver
object for the attachRequired == false
check.
While looking into the filtering again, I noticed a misunderstanding on my side. In short, the filtering I introduced in this PR and the filtering we envisioned wouldn't solve the problem at hand.
The actual problem is that PodVolumeInfo
returned by getPodVolumeInfos
has two different lists of volumes.
persistentVolumeList
contains the referenced PV names of all PVCs referenced in the pod. This list is not filtered and is only used in waitForReattach
.volumeList
contains the volume IDs/handles corresponding to persistentVolumeList
. It could also be a subset thereof as it is filtered in Driver.GetVolumeIDs
when mapping PV names to IDs. E.g., for openstack, this list only contains cinder volumes (in-tree or CSI). This list is only used in waitForDetach
.See doAccountingOfPvs
: https://github.com/gardener/machine-controller-manager/blob/97cd75245dacc4e40327ded2f28a37a4a3f42cf1/pkg/util/provider/drain/drain.go#L510-L542
So when doing the filtering in getVolIDsFromDriver
(as originally proposed by this PR), we only filter the volumeList
used in waitForDetach
.
However, this list never contained NFS or non-CSI volumes as it is already filtered by the driver.
The problem is that the persistentVolumeList
is not filtered and the machine controller waits for reattachment of NFS or non-CSI volumes until it runs into a timeout.
To solve the problem, I reworked the PR so that there is only a single volume list in PodVolumeInfo
that contains both types of information (PV name and volume ID). Only if the driver returns a volume ID, the PV is added to the list. Otherwise, we don't track detachment (this was already the case before) and don't track reattachment (new, this is the fix).
The PR is still missing unit tests. First, I want to reproduce the issue on a test cluster and ensure that the change fixes the issue. In the meantime, feedback on the change is highly appreciated.
I reproduced the issue described above on an OpenStack shoot using the following steps:
GetVolumeIDs
.
kustomize build "github.com/rancher/local-path-provisioner/deploy?ref=v0.0.30" | kubectl apply -f -
apiVersion: apps/v1
kind: StatefulSet
metadata:
creationTimestamp: null
labels:
app: test
name: test
spec:
replicas: 1
selector:
matchLabels:
app: test
template:
metadata:
creationTimestamp: null
labels:
app: test
spec:
containers:
- name: volume-test
image: nginx:stable-alpine
imagePullPolicy: IfNotPresent
volumeMounts:
- name: test
mountPath: /data
ports:
- containerPort: 80
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: test
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 128Mi
storageClassName: local-path
$ k get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-0 1/1 Running 0 50s 100.80.121.84 shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p <none> <none>
$ k get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
test-test-0 Bound pvc-283f2aef-3066-4b23-bc61-30e750ea5340 128Mi RWO local-path 4m30s
$ gardenctl target control-plane
$ k delete machine shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p
$ stern machine | grep -e shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p -e attach -e detach
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:14:13.429520 1 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p" with providerID "openstack:///RegionOne/0edc5a3c-7ea3-4fd8-9b45-a75e386f178a" and backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p" with drain-timeout:2h0m0s & maxEvictRetries:360
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:14:13.634676 1 drain.go:834] Node name: "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p", list of pod PVs to wait for detach: [] machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:14:13.634699 1 drain.go:914] Waiting for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:15:43.635770 1 drain.go:932] Timeout occurred while waiting for PVs [pvc-283f2aef-3066-4b23-bc61-30e750ea5340] to reattach to a different node machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack W1010 12:15:43.635857 1 drain.go:771] Timeout occurred for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]
machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:15:43.635954 1 drain.go:781] Pod + volume detachment from node shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p + volume reattachment to another node for Pod default/test-0 took 1m30.05787152s machine-controller-manager-845b9d96f6-wl7dq machine-controller-manager-provider-openstack I1010 12:15:43.636047 1 drain.go:200] Machine drain ended on 2024-10-10 12:15:43.636038402 +0000 UTC m=+2075.047348776 and took 1m30.163757152s for "shoot--garden--s-eu01-000-worker-z2-54c6c-l7q9p"
The logs show that `waitForDetach` skips the non-cinder volume, as `volumeList` is filtered correctly: `list of pod PVs to wait for detach: []`.
However, `waitForReattach` still waits for the volume to reattach, as `persistentVolumeList` is not filtered: `Waiting for following volumes to reattach: [pvc-283f2aef-3066-4b23-bc61-30e750ea5340]`.
With this, MCM runs into the configured `--machine-pv-reattach-timeout` (90s by default).
The 90s for this unnecessary attachment tracking sums up quickly when rolling node pools having many provider-unrelated PVs.
Now, I verified my fix with a manually built image of mcm-provider-openstack and the steps described above.
With the change, both lists for detach and reattach tracking are empty and the drain operation finishes quickly without considering the non-cinder volume (as desired):
I1010 15:57:11.162420 1 machine_util.go:1130] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j" with providerID "openstack:///RegionOne/accb65ba-9cae-47b5-984d-0598daaae381" and backing node "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j" with drain-timeout:2h0m0s & maxEvictRetries:360
W1010 15:57:11.329585 1 drain.go:865] Node name: "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j", list of pod PVs to wait for detach: []
W1010 15:57:11.329639 1 drain.go:941] List of pod PVs waiting for reattachment is 0: []
I1010 15:57:11.329721 1 drain.go:810] Pod + volume detachment from node shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j + volume reattachment to another node for Pod default/test-0 took 30.042329ms
I1010 15:57:11.329761 1 drain.go:228] Machine drain ended on 2024-10-10 15:57:11.329751778 +0000 UTC m=+486.934211889 and took 126.250956ms for "shoot--garden--s-eu01-000-worker-z1-6f5b7-cbq7j"
I also tested the change with a node that had 3 cinder volumes attached:
I1010 15:59:46.496527 1 machine_util.go:1130] Normal delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" with providerID "openstack:///RegionOne/1fa6c33c-cb16-4ae9-b70e-7387aaf938cf" and backing node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" with drain-timeout:2h0m0s & maxEvictRetries:360
I1010 15:59:46.684754 1 drain.go:574] PodVolumeInfos: map[garden/prometheus-aggregate-0:{[{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]} garden/prometheus-cache-0:{[{pv-shoot--garden--s-eu01-000-21332da3-f21d-41d3-8c06-f2fdef77c9cf b3c6dc59-ec47-4486-bc0a-844824a5081e}]} garden/vali-0:{[{pv-shoot--garden--s-eu01-000-cb1a711c-1c05-468a-8266-025acf81dff3 3aad22f5-cade-44a2-aae9-bf8305b73bc1}]}]
I1010 15:59:46.753823 1 drain.go:763] Pod eviction/deletion for Pod garden/prometheus-aggregate-0 in Node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" and took 68.895848ms. Now waiting for volume detachment.
I1010 15:59:46.753853 1 drain.go:869] Waiting for following volumes to detach: [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]
I1010 15:59:46.909744 1 drain.go:892] No of attached volumes for node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" is [{kubernetes.io/csi/cinder.csi.openstack.org^b3c6dc59-ec47-4486-bc0a-844824a5081e } {kubernetes.io/csi/cinder.csi.openstack.org^668f1bc6-3c5e-4683-bd34-b26d0874ffd1 } {kubernetes.io/csi/cinder.csi.openstack.org^3aad22f5-cade-44a2-aae9-bf8305b73bc1 }]
I1010 15:59:46.909938 1 drain.go:908] Found volume "668f1bc6-3c5e-4683-bd34-b26d0874ffd1" still attached to node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7". Will re-check in 5s
...
I1010 15:59:53.514517 1 volume_attachment.go:42] Dispatching request for PV pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03
I1010 15:59:56.928776 1 drain.go:892] No of attached volumes for node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7" is [{kubernetes.io/csi/cinder.csi.openstack.org^b3c6dc59-ec47-4486-bc0a-844824a5081e } {kubernetes.io/csi/cinder.csi.openstack.org^3aad22f5-cade-44a2-aae9-bf8305b73bc1 }]
I1010 15:59:56.929045 1 drain.go:921] Detached volumes [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}] from node "shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7"
I1010 15:59:56.929243 1 drain.go:787] Pod + volume detachment from Node garden for Pod prometheus-aggregate-0/shoot--garden--s-eu01-000-worker-z1-6f5b7-459p7 and took 10.244327791s
I1010 15:59:56.929338 1 drain.go:945] Waiting for following volumes to reattach: [{pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03 668f1bc6-3c5e-4683-bd34-b26d0874ffd1}]
Now, that the fix is verified and I'm confident in the change of this PR, I only need to add unit tests.
While verifying my change, I discovered another bug (unrelated to this change, but it breaks the reattach tracking): https://github.com/gardener/machine-controller-manager/issues/945
Hey @gardener/mcm-maintainers @kon-angelo. I finished verifying this PR and adding tests. It's ready for a full review now :)
@timebertt sorry for dumb question, but was there any additional code introduced for NFS filtering ? Since you raised that as as point, but we can't find the same. I am assuming that the standard filtering that is currently implemented in the provider implemention for Driver.GetVolumeIDs
already does this job, right ? Or am I mistaken ?
was there any additional code introduced for NFS filtering ? [...] I am assuming that the standard filtering that is currently implemented in the provider implemention for
Driver.GetVolumeIDs
already does this job, right ?
Yes, the filtering in Driver.GetVolumeIDs
does the job. Before this PR, the filtering was only applied to the input of waitForDetach
but not to the input of waitForReattach
.
@elankath can you take another look please?
@timebertt Just to confirm: was the force-drain logic also tested ? One can test with by using k label mc <machineName> force-deletion=true
. This does deletion of volume attachments which wont be valid for this use-case ? Also - just to confirm for clarity of understanding- there is no explicit filtering for NFS volumes anywhere in MCM right, ie we are depending on the provider impl doing the filtering.
Thanks!
was the force-drain logic also tested ? One can test with by using
k label mc <machineName> force-deletion=true
. This does deletion of volume attachments which wont be valid for this use-case ?
I just tested the force deletion of a machine that had three cinder volumes and one local-path volume (see steps to reproduce: https://github.com/gardener/machine-controller-manager/pull/937#issuecomment-2404928585).
MCM successfully deleted all pods and VolumeAttachments
on the node as expected:
I1018 06:33:31.228271 1 machine_util.go:1121] Force delete/drain has been triggerred for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" with providerID "openstack:///RegionOne/40370abf-174b-4872-83fa-343339a0ccaa" and backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" due to Label:true, timeout:false
I1018 06:33:31.421384 1 drain.go:490] Forceful eviction of pods on the node: "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.421736 1 drain.go:1028] Evicting pod garden/vali-0 from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.421798 1 drain.go:387] Attempting to force-delete the pod:"vali-0" from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.422110 1 drain.go:1028] Evicting pod default/test-0 from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.422129 1 drain.go:387] Attempting to force-delete the pod:"test-0" from node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.489072 1 drain.go:228] Machine drain ended on 2024-10-18 06:33:31.489041721 +0000 UTC m=+515.338557327 and took 216.774911ms for "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:31.489190 1 machine_util.go:1185] Drain successful for machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" ,providerID "openstack:///RegionOne/40370abf-174b-4872-83fa-343339a0ccaa", backing node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr".
I1018 06:33:36.515974 1 machine.go:116] reconcileClusterMachine: Start for "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr" with phase:"Terminating", description:"Force Drain successful. Delete Volume Attachments"
I1018 06:33:36.516125 1 machine_util.go:1683] (deleteVolumeAttachmentsForNode) Deleting #3 VolumeAttachment(s) for node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I1018 06:33:36.562143 1 machine_util.go:1269] (deleteNodeVolAttachments) Successfully deleted all volume attachments for node "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr", machine "shoot--garden--s-eu01-000-worker-z2-54c6c-dtstr"
I looked into the code and found that the deletion of VolumeAttachments
in case of a force deletion is not related to the drain code that I changed in this PR. It also doesn't use the driver at all and just looks at the VolumeAttachments
objects.
Why do you think that this functionality might be affected by this PR?
Also - just to confirm for clarity of understanding- there is no explicit filtering for NFS volumes anywhere in MCM right, ie we are depending on the provider impl doing the filtering.
Yes, exactly.
Thanks for the clarification and test Tim. (The test is not related to your PR - just a basic sanity check just to check that the force-deletion flow was not impacted by drain)
I think https://github.com/gardener/machine-controller-manager/pull/937#discussion_r1801219293 is still open. Can you take a look at my thoughts?
No, I believe there is no need to restructure. Ideally we should have have had a better response instead of just a []string
but that will need to wait for the MCM overhaul.
tests passed on AWS. I'll fix https://github.com/gardener/machine-controller-manager/issues/945. Thanks for raising this issue.
/kind bug
What this PR does / why we need it:
When draining nodes, MCM already ignores provider-unrelated PVs when waiting for the detachments. However, it does wait for such PVs to reattach to other nodes until running into a timeout.
This PR ensures that the same filter of the driver is used for tracking reattachments of PVs as well.
For example, when draining nodes where NFS volumes are used, MCM would wait for the reattachment of these volumes. There is no such thing as an attachment to a node for NFS volumes. NFS volumes only need to be mounted (by kubelet). Accordingly, MCM always runs into the configured PV reattach timeout, if there are pods with NFS PVCs on the to-be-drained node. This prolongs the drain operation of such nodes severely.
Which issue(s) this PR fixes: Fixes #
Special notes for your reviewer:
cc @xoxys
Release note: