The query is: "sum by (namespace, pod, container, cluster) (kube_pod_container_status_restarts_total{job=\"kube-state-metrics\", namespace=\"kube-system\"}) > 0"
The metric being used in the query is a counter so it will always be increasing. This means that if a pod has ever restarted it will fire forever. I think the correct approach would be using a function such as increase over a period of time.
Also why does the alert only restrict to pods in the kube-system namespace? I believe it would be useful to be across all namespaces.
The recommended Alert for Pod Restart in https://github.com/Azure/prometheus-collector/blob/main/mixins/kubernetes/rules/recording_and_alerting_rules/templates/ci_recommended_alerts.json does not seem to be correct.
The query is: "sum by (namespace, pod, container, cluster) (kube_pod_container_status_restarts_total{job=\"kube-state-metrics\", namespace=\"kube-system\"}) > 0"
The metric being used in the query is a counter so it will always be increasing. This means that if a pod has ever restarted it will fire forever. I think the correct approach would be using a function such as increase over a period of time.
Also why does the alert only restrict to pods in the kube-system namespace? I believe it would be useful to be across all namespaces.