Closed ialidzhikov closed 3 years ago
Do the apiserver_current_inflight_requests
and apiserver_current_inqueue_requests
metrics of the affected KAPI indicate that it's overloaded?
Do the
apiserver_current_inflight_requests
andapiserver_current_inqueue_requests
metrics of the affected KAPI indicate that it's overloaded?
I don't see anything abnormal for apiserver_current_inflight_requests
. I cannot find any metric for apiserver_current_inqueue_requests
.
Shoot spec uses the defaults:
kubernetes:
kubeAPIServer:
requests:
maxNonMutatingInflight: 400
maxMutatingInflight: 200
Can you check for any broken APIService
s or webhook configurations?
Can you check for any broken
APIService
s or webhook configurations?
You are right. I now played with it.
With all APIServices available it takes less than 30s to reconcile the shoot-core ManagedResource:
{"level":"info","ts":"2021-02-15T16:44:49.451Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T16:45:11.250Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
With an APIService that is unavailable it takes between 15m, 20m and more:
{"level":"info","ts":"2021-02-15T16:48:24.425Z","logger":"controller-runtime.manager.resource-controller","msg":"Starting to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
{"level":"info","ts":"2021-02-15T17:03:22.428Z","logger":"controller-runtime.manager.resource-controller","msg":"Finished to reconcile ManagedResource","object":"shoot--foo--bar/shoot-core"}
I will move this issue under g/gardener-resource-manager.
Steps to reproduce:
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
name: v1beta1.external.metrics.k8s.io
spec:
group: external.metrics.k8s.io
groupPriorityMinimum: 100
insecureSkipTLSVerify: true
service:
name: keda-operator-metrics-apiserver
namespace: keda
port: 443
version: v1beta1
versionPriority: 100
@mvladev , probably it is the same issue that caused problems during the KonnectivityTunnel enablement.
Thanks @ialidzhikov for the great description (again)! Do you have already some ideas why a broken APIService
is causing this behaviour?
Also, maybe @timebertt can comment here as well. He tried to reproduce the same behaviour some weeks ago but wasn't able to do so. What was the difference with your setup back then and @ialidzhikov's setup now?
What was the difference with your setup back then and @ialidzhikov's setup now?
Hmm, that's a good question. IIRC, I tried to reproduce it with an example MR from this repo and also on a shoot cluster, but in both cases, I didn't observe any discovery failures or long reconciliation times.
But indeed, it seems like grm is doing a lot of discovery calls here.
Maybe this only happens if there is an unavailable APIService
on startup, and not if an APIService
comes unavailable during runtime... Maybe that was the difference in my tests 🤔
I don't know how, but maybe https://github.com/gardener/gardener-resource-manager/pull/111/commits/f74b1da30b27f0667193cb26f900da52c244b3a7 could also be related here?
/assign @timebertt @rfranzke /in-progress
How to categorize this issue?
/area ops-productivity /kind bug
What happened:
I see an Shoot cluster which
SystemComponentsHealthy
condition is flapping quite often between healthy and unhealthy:When I check the logs of the gardener-resource-manager I see that shoot-core ManagedResource is reconciled for more than 20m:
What you expected to happen:
Reconcile of shoot-core ManagedResource to take up to several minutes at most.
How to reproduce it (as minimally and precisely as possible): Not clear for now.
In the logs of gardener-resource-manager I see logs about throttling requests:
Not sure, but from the logs it seems that gardener-resource-manager is doing quite a lot of discovery calls.
Anything else we need to know?:
Environment:
kubectl version
): v1.18.12