gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
257 stars 117 forks source link

MCM doesn't wait for volumes to reattach #945

Closed timebertt closed 1 month ago

timebertt commented 1 month ago

How to categorize this issue?

/area robustness /kind bug /priority 2

What happened:

After waiting for volumes to detach from a terminating node, MCM waits for the volumes to reattach to another node. However, it immediately runs into a "timeout" instead of waiting (it's rather a context cancellation, but the logs say "timeout"):

I1010 16:22:37.863207       1 drain.go:752] Pod + volume detachment from Node garden for Pod prometheus-aggregate-0/shoot--garden--s-eu01-000-worker-z1-6f5b7-48qxs and took 10.193009281s
I1010 16:22:37.863483       1 drain.go:908] Waiting for following volumes to reattach: [pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03]
I1010 16:22:37.863751       1 drain.go:931] VolumeAttachment event received for PV: pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03
W1010 16:22:37.865246       1 drain.go:926] Timeout occurred while waiting for PVs [pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03] to reattach to a different node
W1010 16:22:37.865326       1 drain.go:771] Timeout occurred for following volumes to reattach: [pv-shoot--garden--s-eu01-000-5ca2886d-85d9-46d6-8ef3-175d194aaa03]

The effect is that MCM no longer waits for volumes to reattach and continues with the next pod right away.

What you expected to happen:

MCM should wait for volumes to reattach to other nodes before evicting the next pod.

How to reproduce it (as minimally and precisely as possible):

Delete a machine with persistent volumes attached to it and observe the logs.

Anything else we need to know?:

The cause for this bug is that the context created for waitForReattach inherits from ctx (ref), which is the context created for waitForDetach. However, ctx is cancelled immediately after waitForDetach returns (ref), so the ctx for waitForReattach is cancelled right from the start.

It seems this was broken by https://github.com/gardener/machine-controller-manager/pull/920 (cc @sssash18).

Environment: