Closed faevourite closed 1 week ago
Hey @faevourite - thanks for reaching out.
They should be there already. I just port-forwarded the metrics endpoint of a rule-evaluator from a real GKE cluster and was able to see it.
# HELP prometheus_rule_evaluation_duration_seconds The duration for a rule to execute.
# TYPE prometheus_rule_evaluation_duration_seconds summary
prometheus_rule_evaluation_duration_seconds{quantile="0.5"} 0.21801961
prometheus_rule_evaluation_duration_seconds{quantile="0.9"} 0.251744665
prometheus_rule_evaluation_duration_seconds{quantile="0.99"} 1.291879739
prometheus_rule_evaluation_duration_seconds_sum 11263.593287203035
prometheus_rule_evaluation_duration_seconds_count 119163
# HELP prometheus_rule_evaluation_failures_total The total number of rule evaluation failures.
# TYPE prometheus_rule_evaluation_failures_total counter
prometheus_rule_evaluation_failures_total{rule_group="/etc/rules/rules__default__example-rules.yaml;example"} 7
# HELP prometheus_rule_evaluations_total The total number of rule evaluations.
# TYPE prometheus_rule_evaluations_total counter
prometheus_rule_evaluations_total{rule_group="/etc/rules/rules__default__example-rules.yaml;example"} 119163
# HELP prometheus_rule_group_duration_seconds The duration of rule group evaluations.
# TYPE prometheus_rule_group_duration_seconds summary
prometheus_rule_group_duration_seconds{quantile="0.01"} 0.291422544
prometheus_rule_group_duration_seconds{quantile="0.05"} 0.291422544
prometheus_rule_group_duration_seconds{quantile="0.5"} 0.501621914
prometheus_rule_group_duration_seconds{quantile="0.9"} 0.75325986
prometheus_rule_group_duration_seconds{quantile="0.99"} 1.570991952
prometheus_rule_group_duration_seconds_sum 11266.405408267998
prometheus_rule_group_duration_seconds_count 57959
Assuming it's done for now, we can reopen if you found otherwise @faevourite, thanks!
Thank you! Apologies, this was indeed user error. :)
Prometheus has some built-in metrics like the counter
prometheus_rule_evaluation_failures_total
, which is incremented any time there's an issue evaluating a recording/alerting rule. This is a convenient alternative to watching the logs for errors. Could this metric and any others that would make sense from GMP's perspective be added to the rule-evaluator?