Open johnswarbrick-napier opened 1 year ago
do you use rancher ? or have completed pods or many jobs ? also please check if you using kyverno https://kyverno.io/docs/troubleshooting/#api-server-is-blocked
can you please enable the diagnostics settings on AKS for API server and run this query to check most calls
AzureDiagnostics | where Category == "kube-audit" | extend p = parse_json(log_s) | project TimeGenerated, Category , pod_s, ID=tostring(p.containerID), stage=tostring(p.stage),requestURI=tostring(p.requestURI),verb=tostring(p.verb),userAgent=tostring(p.userAgent),status=tostring(p.responseStatus.code), user=tostring(p.user.username),latency=datetimediff('millisecond', todatetime(p.stageTimestamp), todatetime(p.requestReceivedTimestamp)) | where stage contains "ResponseComplete" | summarize count() by verb,requestURI,user, userAgent | order by count desc | take 10
Hi @abarqawi, thanks for replying!
We are not using Rancher or Kyverno.
We don't have many pods or jobs:
I will try enabling the diagnostic setting and report back.
I will also try provisioning a new AKS cluster but not deploy any applications (apart from our monitoring stack) and see if the same spikes appear.
An interesting datapoint is the spikes are at different times per day, suggesting it's not a fixed schedule like a cronjob or similar:
30 May, spikes at 02:45, 04:49, 07:08 and 07:25 29 May, spikes at 09:12, 12:42 28 May, spikes at 09:18, 20:11, 20:13, 20:54, 21:21 and 23:11
We see these events on freshly built AKS clusters where the applications are idle.
I suspect they are coming from the control plane, or some external management process in the Microsoft backend, rather than our applications.
@johnswarbrick-napier Can you create a support case with a sample cluster. We will take a look
Thanks @aritraghosh
TrackingID#2305040050000288
Has been open a while already.
Diagnostics information:
Unless there is a mismatch in the times (Grafana is UTC, I assume Azure Diagnostics is also UTC?) I don't see a correlating spike in API queries that matches up with the latency and timeouts we experience.
But there is clearly something causing them due to the frequent errors and timeouts we experience
I assume the requests are not logged by the Kubernetes API because the API has become unavailable and the request doesn't complete.
@johnswarbrick-napier what AKS Tier are you using?
https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers
I would recommend Standard Tier if running Prometheus.
@johnswarbrick-napier i can see you got answer from support case 2305040050000288 that user strimzi-cluster-operator creating those requests , can you check what is this behavior and how to fine tune ? https://strimzi.io/docs/operators/latest/configuring.html
Hi - we are going to try upgrading the Strimzi Kafka Operator to the latest version and then re-test.
@johnswarbrick-napier did you ever get the results of that retest?
Our issue has been resolved. We are running Keda on AKS. See here for resolution details.
Describe the bug We are experiencing intermittent periods of timeouts and failed queries to the KubeAPI in all our AKS clusters (>40) which is triggering monitoring alerts on a regular basis.
Looking at API Server metrics from our Prometheus stack, we see huge surges in KubeAPI requests of up to 150k. Also high latency and failures.
We have been unable to determine the source of these surges in KubeAPI requests, partly as we do not have access to the Kubernetes control plane logs in AKS to determine the source.
There are no obvious errors in the
kube-system
or any other logs currently available to us.We suspect the source of these surges in queries is not our customer workload (in many situations our clusters are idle with no active workloads), but possible something running on the control plane side such as a reconciliation, upgrade or backup type activity.
Has anyone seen this before and can recommend a resolution?
To Reproduce Steps to reproduce the behavior:
Expected behavior To not receive monitoring alerts related to timeouts and failed queries to the KubeAPI
Environment (please complete the following information):