Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

[BUG] calico-kube-controllers deployment is labeled twice with the CriticalAddonsOnly toleration #4282

Open rgarcia89 opened 1 month ago

rgarcia89 commented 1 month ago

Describe the bug On AKS clusters with calico enabled a namespace calico-system is created. Within that we can find a deployment calico-kube-controllers. This deployment is currently labels twice with the CriticalAddonsOnly toleration. This leads to an error in prometheus starting v2.52.0 as with that version a check for duplicate samples has been introduced.

       tolerations:
       - key: CriticalAddonsOnly # <- no 1
         operator: Exists
       - effect: NoSchedule
         key: node-role.kubernetes.io/master
       - effect: NoSchedule
         key: node-role.kubernetes.io/control-plane
       - key: CriticalAddonsOnly # <- no 2
         operator: Exists

The above situation leads to such a situation, as the kube-state-metrics pod creates the same metric twice - due to the second existens of the CriticalAddonsOnly toleration. I had created a issue on the prometheus project, as I was expecting it to be a prometheus issue, which it isn't. https://github.com/prometheus/prometheus/issues/14089

Prometheus log output

ts=2024-05-13T19:20:40.233Z caller=main.go:1372 level=info msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml totalDuration=95.860644ms db_storage=1.142µs remote_storage=150.634µs web_handler=872ns query_engine=776ns scrape=98.941µs scrape_sd=7.197985ms notify=13.095µs notify_sd=269.119µs rules=54.251368ms tracing=6.745µs
...
ts=2024-05-13T19:21:09.190Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"calico-system\",pod=\"calico-kube-controllers-75c647b46c-pg9cr\",uid=\"bf944c52-17bd-438b-bbf1-d97f8671bd6b\",key=\"CriticalAddonsOnly\",operator=\"Exists\"}"
ts=2024-05-13T19:21:09.207Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

Environment (please complete the following information):

felixZdi commented 1 month ago

Same for v1.28.9

rgarcia89 commented 1 month ago

@sabbour this validation could break the above mentioned deployment https://github.com/kubernetes/kubernetes/issues/124881

idogada-akamai commented 3 weeks ago

Any update on this?

Aaron-ML commented 3 weeks ago

Would love to see this resolved, this is creating log spam and alerts on our prometheus stack due to duplicate labels.

rgarcia89 commented 3 weeks ago

@Aaron-ML I am also using the kube-prometheus-stack and have downgraded prometheus to v2.51.2 until it is fixed...

Aaron-ML commented 3 weeks ago

@Aaron-ML I am also using the kube-prometheus-stack and have downgraded prometheus to v2.51.2 until it is fixed...

We've mitigated it for now by temporarily removing the alert related to prometheus ingest failures. Hopefully this gets resolved soon.

rgarcia89 commented 1 week ago

@chasewilson any update available?