CI Recomended Alert for Pod Restart incorrect

abdullah248 commented 1 year ago

The recommended Alert for Pod Restart in https://github.com/Azure/prometheus-collector/blob/main/mixins/kubernetes/rules/recording_and_alerting_rules/templates/ci_recommended_alerts.json does not seem to be correct.

The query is: "sum by (namespace, pod, container, cluster) (kube_pod_container_status_restarts_total{job=\"kube-state-metrics\", namespace=\"kube-system\"}) > 0"

The metric being used in the query is a counter so it will always be increasing. This means that if a pod has ever restarted it will fire forever. I think the correct approach would be using a function such as increase over a period of time.

Also why does the alert only restrict to pods in the kube-system namespace? I believe it would be useful to be across all namespaces.

gracewehner commented 1 year ago

Hi @Sohamdg081992, FYI. Could you please take a look when you get a chance? Thanks!

abdullah248 commented 1 year ago

There is also a similar error with the oom alert.

Sohamdg081992 commented 1 year ago

I will take a look into this.

Sohamdg081992 commented 1 year ago

The alert has been fixed.

Azure / prometheus-collector

CI Recomended Alert for Pod Restart incorrect #325