[BUG] AKS - KubeAPI high latency / failures during huge surges in requests

johnswarbrick-napier commented 1 year ago

Describe the bug We are experiencing intermittent periods of timeouts and failed queries to the KubeAPI in all our AKS clusters (>40) which is triggering monitoring alerts on a regular basis.

Looking at API Server metrics from our Prometheus stack, we see huge surges in KubeAPI requests of up to 150k. Also high latency and failures.

We have been unable to determine the source of these surges in KubeAPI requests, partly as we do not have access to the Kubernetes control plane logs in AKS to determine the source.

There are no obvious errors in the kube-system or any other logs currently available to us.

We suspect the source of these surges in queries is not our customer workload (in many situations our clusters are idle with no active workloads), but possible something running on the control plane side such as a reconciliation, upgrade or backup type activity.

Has anyone seen this before and can recommend a resolution?

To Reproduce Steps to reproduce the behavior:

Deploy a Prometheus / Grafana stack to gather KubeAPI metrics from AKS
Observe the high volumes of KubeAPI queries

Expected behavior To not receive monitoring alerts related to timeouts and failed queries to the KubeAPI

Environment (please complete the following information):

Kubernetes version: v1.24.9

abarqawi commented 1 year ago

do you use rancher ? or have completed pods or many jobs ? also please check if you using kyverno https://kyverno.io/docs/troubleshooting/#api-server-is-blocked

can you please enable the diagnostics settings on AKS for API server and run this query to check most calls

AzureDiagnostics | where Category == "kube-audit" | extend p = parse_json(log_s) | project TimeGenerated, Category , pod_s, ID=tostring(p.containerID), stage=tostring(p.stage),requestURI=tostring(p.requestURI),verb=tostring(p.verb),userAgent=tostring(p.userAgent),status=tostring(p.responseStatus.code), user=tostring(p.user.username),latency=datetimediff('millisecond', todatetime(p.stageTimestamp), todatetime(p.requestReceivedTimestamp)) | where stage contains "ResponseComplete" | summarize count() by verb,requestURI,user, userAgent | order by count desc | take 10

johnswarbrick-napier commented 1 year ago

Hi @abarqawi, thanks for replying!

We are not using Rancher or Kyverno.

We don't have many pods or jobs:

I will try enabling the diagnostic setting and report back.

I will also try provisioning a new AKS cluster but not deploy any applications (apart from our monitoring stack) and see if the same spikes appear.

An interesting datapoint is the spikes are at different times per day, suggesting it's not a fixed schedule like a cronjob or similar:

30 May, spikes at 02:45, 04:49, 07:08 and 07:25 29 May, spikes at 09:12, 12:42 28 May, spikes at 09:18, 20:11, 20:13, 20:54, 21:21 and 23:11

We see these events on freshly built AKS clusters where the applications are idle.

I suspect they are coming from the control plane, or some external management process in the Microsoft backend, rather than our applications.

aritraghosh commented 1 year ago

@johnswarbrick-napier Can you create a support case with a sample cluster. We will take a look

johnswarbrick-napier commented 1 year ago

Thanks @aritraghosh

TrackingID#2305040050000288

Has been open a while already.

johnswarbrick-napier commented 1 year ago

Diagnostics information:

Unless there is a mismatch in the times (Grafana is UTC, I assume Azure Diagnostics is also UTC?) I don't see a correlating spike in API queries that matches up with the latency and timeouts we experience.

But there is clearly something causing them due to the frequent errors and timeouts we experience

I assume the requests are not logged by the Kubernetes API because the API has become unavailable and the request doesn't complete.

miwithro commented 1 year ago

@johnswarbrick-napier what AKS Tier are you using?

https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers

I would recommend Standard Tier if running Prometheus.

abarqawi commented 1 year ago

@johnswarbrick-napier i can see you got answer from support case 2305040050000288 that user strimzi-cluster-operator creating those requests , can you check what is this behavior and how to fine tune ? https://strimzi.io/docs/operators/latest/configuring.html

johnswarbrick-napier commented 1 year ago

Hi - we are going to try upgrading the Strimzi Kafka Operator to the latest version and then re-test.

RooMaiku commented 9 months ago

@johnswarbrick-napier did you ever get the results of that retest?

jfouche-vendavo commented 1 month ago

Our issue has been resolved. We are running Keda on AKS. See here for resolution details.

Azure / AKS

[BUG] AKS - KubeAPI high latency / failures during huge surges in requests #3685