ManagedResource reconciliation takes more than 20m when APIService in unavailable

ialidzhikov commented 3 years ago

How to categorize this issue?

/area ops-productivity /kind bug

What happened:

I see an Shoot cluster which SystemComponentsHealthy condition is flapping quite often between healthy and unhealthy:

    - type: SystemComponentsHealthy
      status: 'False'
      lastTransitionTime: '2021-02-15T09:55:34Z'
      lastUpdateTime: '2021-02-15T09:53:33Z'
      reason: ApplyFailed
      message: >-
        Could not apply all new resources: 1 error occurred: Operation cannot be
        fulfilled on horizontalpodautoscalers.autoscaling "coredns": the object
        has been modified; please apply your changes to the latest version and
        try again

When I check the logs of the gardener-resource-manager I see that shoot-core ManagedResource is reconciled for more than 20m:

$ cat ~/gardener-resource-manager-dccf959d5-kcpc4.log | grep '"object":"shoot--foo--bar/shoot-core"' | grep -v 'controller-runtime.manager.health-controller'

{"level":"info","ts":"2021-02-15T10:38:59.091Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:38:59.091Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T10:52:16.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:02:59.166Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:02:59.166Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:24:23.950Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:24:23.950Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.943Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.944Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T11:43:51.944Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.421Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.422Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:01:52.422Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:25:14.030Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:25:14.030Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T12:51:01.543Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:11:27.942Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:31:54.363Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:31:54.363Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.759Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.760Z","logger":"controller-runtime.manager.resource-controller","msg":"reconcile: action required: true, responsible: true","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T13:54:46.760Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}

What you expected to happen:

Reconcile of shoot-core ManagedResource to take up to several minutes at most.

How to reproduce it (as minimally and precisely as possible): Not clear for now.

In the logs of gardener-resource-manager I see logs about throttling requests:

$ cat ~/gardener-resource-manager-dccf959d5-kcpc4.log | grep 'Throttling request'

# full output omitted

I0215 13:57:12.080636       1 request.go:621] Throttling request took 13.597998551s, request: GET:https://kube-apiserver/apis/crd.projectcalico.org/v1?timeout=32s
I0215 13:57:22.280577       1 request.go:621] Throttling request took 9.197729656s, request: GET:https://kube-apiserver/apis/servicecatalog.kyma-project.io/v1alpha1?timeout=32s
I0215 13:57:32.280683       1 request.go:621] Throttling request took 4.598267175s, request: GET:https://kube-apiserver/apis/settings.svcat.k8s.io/v1alpha1?timeout=32s
I0215 13:57:43.480787       1 request.go:621] Throttling request took 1.195492007s, request: GET:https://kube-apiserver/apis/apps/v1?timeout=32s
I0215 13:57:53.680666       1 request.go:621] Throttling request took 11.394955641s, request: GET:https://kube-apiserver/apis/serverless.kyma-project.io/v1alpha1?timeout=32s
I0215 13:58:03.880641       1 request.go:621] Throttling request took 6.99571505s, request: GET:https://kube-apiserver/apis/settings.svcat.k8s.io/v1alpha1?timeout=32s
I0215 13:58:14.080556       1 request.go:621] Throttling request took 2.597793766s, request: GET:https://kube-apiserver/apis/extensions/v1beta1?timeout=32s
I0215 13:58:24.080630       1 request.go:621] Throttling request took 12.597752422s, request: GET:https://kube-apiserver/apis/batch/v1?timeout=32s
I0215 13:58:34.280663       1 request.go:621] Throttling request took 8.107359855s, request: GET:https://kube-apiserver/apis/autoscaling/v1?timeout=32s
I0215 13:58:44.480527       1 request.go:621] Throttling request took 3.798117917s, request: GET:https://kube-apiserver/apis/install.istio.io/v1alpha1?timeout=32s
I0215 13:58:54.680493       1 request.go:621] Throttling request took 13.997952271s, request: GET:https://kube-apiserver/apis/ui.kyma-project.io/v1alpha1?timeout=32s
I0215 13:59:04.680516       1 request.go:621] Throttling request took 9.397612672s, request: GET:https://kube-apiserver/apis/extensions/v1beta1?timeout=32s
I0215 13:59:14.880495       1 request.go:621] Throttling request took 4.998066968s, request: GET:https://kube-apiserver/apis/flows.knative.dev/v1alpha1?timeout=32

Not sure, but from the logs it seems that gardener-resource-manager is doing quite a lot of discovery calls.

Anything else we need to know?:

Environment:

Gardener version: v1.16.2
gardener-resource-manager version: v0.21.0
Kubernetes version (use kubectl version): v1.18.12
Cloud provider or hardware configuration: Azure
Others:

rfranzke commented 3 years ago

Do the apiserver_current_inflight_requests and apiserver_current_inqueue_requests metrics of the affected KAPI indicate that it's overloaded?

ialidzhikov commented 3 years ago

Do the apiserver_current_inflight_requests and apiserver_current_inqueue_requests metrics of the affected KAPI indicate that it's overloaded?

I don't see anything abnormal for apiserver_current_inflight_requests. I cannot find any metric for apiserver_current_inqueue_requests.

Shoot spec uses the defaults:

  kubernetes:
    kubeAPIServer:
      requests:
        maxNonMutatingInflight: 400
        maxMutatingInflight: 200

rfranzke commented 3 years ago

Can you check for any broken APIServices or webhook configurations?

ialidzhikov commented 3 years ago

Can you check for any broken APIServices or webhook configurations?

You are right. I now played with it.

With all APIServices available it takes less than 30s to reconcile the shoot-core ManagedResource:

{"level":"info","ts":"2021-02-15T16:44:49.451Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T16:45:11.250Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}

With an APIService that is unavailable it takes between 15m, 20m and more:

{"level":"info","ts":"2021-02-15T16:48:24.425Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T17:03:22.428Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}

I will move this issue under g/gardener-resource-manager.

ialidzhikov commented 3 years ago

Steps to reproduce:

Create an APIService that will be unavailable

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  name: v1beta1.external.metrics.k8s.io
spec:
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: keda-operator-metrics-apiserver
    namespace: keda
    port: 443
  version: v1beta1
  versionPriority: 100

Make sure that the reconcile of shoot-core ManagedResource is extremely slow

ialidzhikov commented 3 years ago

@mvladev , probably it is the same issue that caused problems during the KonnectivityTunnel enablement.

rfranzke commented 3 years ago

Thanks @ialidzhikov for the great description (again)! Do you have already some ideas why a broken APIService is causing this behaviour? Also, maybe @timebertt can comment here as well. He tried to reproduce the same behaviour some weeks ago but wasn't able to do so. What was the difference with your setup back then and @ialidzhikov's setup now?

timebertt commented 3 years ago

What was the difference with your setup back then and @ialidzhikov's setup now?

Hmm, that's a good question. IIRC, I tried to reproduce it with an example MR from this repo and also on a shoot cluster, but in both cases, I didn't observe any discovery failures or long reconciliation times.

But indeed, it seems like grm is doing a lot of discovery calls here. Maybe this only happens if there is an unavailable APIService on startup, and not if an APIService comes unavailable during runtime... Maybe that was the difference in my tests 🤔

rfranzke commented 3 years ago

I don't know how, but maybe https://github.com/gardener/gardener-resource-manager/pull/111/commits/f74b1da30b27f0667193cb26f900da52c244b3a7 could also be related here?

rfranzke commented 3 years ago

/assign @timebertt @rfranzke /in-progress

gardener-attic / gardener-resource-manager

ManagedResource reconciliation takes more than 20m when APIService in unavailable #109