gardener / dependency-watchdog

This controller checks the status of etcd and restarts control plane components which are in a state of crashloop-backoff over an extensive period of time.
Apache License 2.0
4 stars 28 forks source link

Dependency watch dog has no permissions to patch deployments #97

Closed MartinWeindel closed 9 months ago

MartinWeindel commented 10 months ago

How to categorize this issue?

/area control-plane /kind bug /priority 2

What happened: On a seed, istio-ingressgateway pods were running on same node and not available at the same time. As a result, communication between kubelets and kube-apiserver was disrupted. MCM started to remove "stale" nodes. The DWD could not intervene, as it is missing permissions to patch deployments:

2024-01-10 09:35:56 | {"log":"2024-01-10T09:35:56Z\tINFO\tdwd.cluster-controller\tRecording external probe failure\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"shootNamespace\": \"shoot--foo--bar\", \"err\": \"Get \\\"https://api.dh-0g73420w.dhaas-live.internal.live.k8s.ondemand.com/version?timeout=30s\\\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\", \"failedAttempts\": 1, \"failureThreshold\": 3}"}
2024-01-10 09:36:29 | {"log":"2024-01-10T09:36:29Z\tINFO\tdwd.cluster-controller\tInternal probe is successful\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"shootNamespace\": \"shoot--foo--bar\", \"successfulAttempts\": 1, \"successThreshold\": 1}"}
2024-01-10 09:36:59 | {"log":"2024-01-10T09:36:59Z\tINFO\tdwd.cluster-controller\tRecording external probe failure\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"shootNamespace\": \"shoot--foo--bar\", \"err\": \"Get \\\"https://api.dh-0g73420w.dhaas-live.internal.live.k8s.ondemand.com/version?timeout=30s\\\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\", \"failedAttempts\": 2, \"failureThreshold\": 3}"}
2024-01-10 09:37:32 | {"log":"2024-01-10T09:37:32Z\tINFO\tdwd.cluster-controller\tInternal probe is successful\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"shootNamespace\": \"shoot--foo--bar\", \"successfulAttempts\": 1, \"successThreshold\": 1}"}
2024-01-10 09:38:02 | {"log":"2024-01-10T09:38:02Z\tINFO\tdwd.cluster-controller\tRecording external probe failure\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"shootNamespace\": \"shoot--foo--bar\", \"err\": \"Get \\\"https://api.dh-0g73420w.dhaas-live.internal.live.k8s.ondemand.com/version?timeout=30s\\\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\", \"failedAttempts\": 3, \"failureThreshold\": 3}"}
2024-01-10 09:38:02 | {"log":"2024-01-10T09:38:02Z\tINFO\tdwd.cluster-controller\tExternal probe is un-healthy, checking if scale down is already done or is still pending\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"shootNamespace\": \"shoot--foo--bar\"}"}
2024-01-10 09:38:02 | {"log":"2024-01-10T09:38:02Z\tINFO\tflow\tStarting\t{\"flow\": \"scale-down-shoot--foo--bar\"}"}
2024-01-10 09:38:02 | {"log":"2024-01-10T09:38:02Z\tINFO\tdwd.cluster-controller\tResource not found. Ignoring this resource as its existence is marked as optional\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"resNamespace\": \"shoot--foo--bar\", \"kind\": \"Deployment\", \"apiVersion\": \"apps/v1\", \"name\": \"cluster-autoscaler\", \"level\": 0}"}
2024-01-10 09:38:02 {"log":"2024-01-10T09:38:02Z\tERROR\tdwd.cluster-controller\tFailed to update annotation to capture the current replicas before scaling it down\t{\"controller\": \"cluster\", \"object\": {\"name\":\"shoot--foo--bar\"}, \"namespace\": \"\", \"name\": \"shoot--foo--bar\", \"reconcileID\": \"c908a9ac-6587-4317-aa01-74bb34cc75b8\", \"resNamespace\": \"shoot--foo--bar\", \"kind\": \"Deployment\", \"apiVersion\": \"apps/v1\", \"name\": \"machine-controller-manager\", \"level\": 0, \"error\": \"deployments.apps \\\"machine-controller-manager\\\" is forbidden: User \\\"system:serviceaccount:garden:dependency-watchdog-prober\\\" cannot patch resource \\\"deployments\\\" in API group \\\"apps\\\" in the namespace \\\"shoot--foo--bar\\\"\"}"}

What you expected to happen: DWD prevents

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

unmarshall commented 10 months ago

@MartinWeindel Thanks for raising this issue. We have now raised https://github.com/gardener/gardener/issues/9035 to fix this. The change needs to be done in DWD component in g/g and not in this repository.

vpnachev commented 10 months ago

/close in favor of gardener/gardener#9035