GRM on Seeds might end up in a crash-loop when their `kube-apiserver` domain starts resolving to a different IP

oliver-goetz commented 7 months ago

How to categorize this issue?

/area robustness /kind bug

What happened: On Seeds which are also Shoots (where KUBERNETES_SERVICE_HOST environment variable is set) the GRM deployed in garden namespace can end up in a crash loop when the domain of its own kube-apiserver suddenly resolves to a different IP. GRM pods on these Seeds. Egress access to GRM is restricted by allow-to-runtime-apiserver NetworkPolicy in garden namespace. This NetworkPolicy allows Egress traffic on port 443 to the endpoints of kubernetes.default.svc and the IPs which are resolved from its kube-apiserver domain (ref). If the kube-apiserver domain starts resolving to a different domain and it is not accessible via its old IP anymore, GRM will crash after a while because it loses access to kube-apiserver. The new GRM pods try to access kube-apiserver via its new IP, but this is not allows by the NetworkPolicy which includes the old IP only. It is not able to recover until the NetworkPolicy is updated. NetworkPolicy controller in gardenlet is responsible to update allow-to-runtime-apiserver NetworkPolicy. When kube-apiserver IP changes gardenlet needs to restart to. Usually, this should not be a problem because it has a NetworkPolicy which allows all Egress traffic to port 443. However, there is a scenario where gardenlet is not able to restart properly. gardenlet requires the GRM HA webhook to start. Otherwise, it panics immediately.

{"level":"info","ts":"2024-04-22T17:47:57.313Z","msg":"Wait completed, proceeding to shutdown the manager"}
panic: 1 error occurred:
    * Internal error occurred: failed calling webhook "high-availability-config.resources.gardener.cloud": failed to call webhook: Post "https://gardener-resource-manager.garden.svc:443/webhooks/high-availability-config?timeout=10s": no endpoints available for service "gardener-resource-manager"

goroutine 1 [running]:
main.main()
    github.com/gardener/gardener/cmd/gardenlet/main.go:30 +0x4d

The panic happens before allow-to-runtime-apiserver NetworkPolicy is updated. It is probably a race between NetworkPolicy controller and seed controller which the later one always wins, because DNS resolution is taking some time.

Thus, there is a race between the restart of gardenlet and GRM. If gardenlet is restarting first, the GRM webhook is still running and it can update allow-to-runtime-apiserver NetworkPolicy. If GRM ist restarting first, GRM and gardenlet are stuck in a crash loop.

This situation can be solved manually only, e.g. by updating the IP in allow-to-runtime-apiserver NetworkPolicy or creating a temporary NetworkPolicy for GRM.

What you expected to happen: GRM should be able to handle IP changes of its kube-apiserver.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: We could prevent this from happening when we create a special NetworkPolicy for GRM which allows Egress traffic to port 443 on any IP. Another option could be that gardenlet starts NetworkPolicy controller earlier, that it can update the network policies before it requires the GRM webhook to be available.

Environment:

Gardener version: v1.91.1
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

vlerenc commented 6 months ago

That is super interesting, a cycle of death. Thank you for reporting. Maybe we just witnessed something similar this weekend. GRM couldn't reach its KAPI (egress pod traffic, but otherwise network from the host network was all fine) and the situation did not auto-resolve.

In general: The controller that normally updates netpols (GRM) should not also update the netpol it is itself subjected to (allow-to-runtime-apiserver) or else the system cannot (self-/remote-)heal anymore.

We should change that/defuse this time bomb, maybe by subjecting it to the same general netpol also the gardenlet is subjected to?

oliver-goetz commented 6 months ago

I mixed up the different NetworkPolicy controllers. GRM is not updating its own NetworkPolicy, but gardenlet does. However, gardenlet can only start if the GRM webhook is available. Thus, if gardenlet is not able to update allow-to-runtime-apiserver network policy before GRM is in crash loop and gardenlet restarts for any reason, the seed is not able to get out of this situation on its own anymore.

I updated the issue accordingly.

gardener-ci-robot commented 3 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Mark this issue as rotten with /lifecycle rotten
Close this issue with /close

/lifecycle stale

gardener-ci-robot commented 2 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close

/lifecycle rotten

gardener-ci-robot commented 1 month ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten

/close

gardener-prow[bot] commented 1 month ago

@gardener-ci-robot: Closing this issue.

In response to [this](https://github.com/gardener/gardener/issues/9528#issuecomment-2362106809): >The Gardener project currently lacks enough active contributors to adequately respond to all issues. >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

gardener / gardener

GRM on Seeds might end up in a crash-loop when their `kube-apiserver` domain starts resolving to a different IP #9528