Scalability issues with MCM

rishabh-11 commented 2 weeks ago

How to categorize this issue?

/area control-plane /area scalability /kind enhancement /priority 1

What happened: In the recent live update, we saw that the worker pools for our seeds were replaced with new ones concurrently. This caused the deletion of old machine-level objects (machine deployments, machine sets, machine classes and machines) and subsequent creation of new machine objects. All the old machines across worker pools went to Terminating state simultaneously.

Since we have a shared queue and around 50 workers picking up items from this queue, this caused massive throttling due to certain potentially long-running operations like the draining of nodes. Because the workers were blocked in the drain operation, the create requests were getting stuck in the queue with no worker available to process these requests.

The drain timeout was 2 hrs but it took more than 4 hrs because of https://github.com/gardener/machine-controller-manager/issues/785 which is part of 0.54.0 version of MCM which had not reached live landscape with the corresponding mcm-provider release.

In the recent live update:-

5-6 new worker pools were introduced to replace the existing 2 worker pools.
We observed around >100 or so machines in the Terminating state.
Around 50 or so create machine requests were stuck in the queue.
For around 4 hrs, due to long drain times and throttling, the machines were stuck in drain.

What you expected to happen: MCM should scale much beyond handling 100 concurrent deletion/creation requests.

hoeltcl commented 1 week ago

Referenced in PTASK0034014 as preventive measure. Do you already have a ETA date?

gardener-robot commented 1 week ago

@hoeltcl You have mentioned internal references in the public. Please check.

gardener / machine-controller-manager

Scalability issues with MCM #943