Open xocasdashdash opened 3 years ago
Would you mind sharing a rule file it fails on?
I had to bisect all the rules (hard to know which one it's reporting on) but this one makes it fail consistently. Please note theres an empty line at the end
groups:
- name: "haproxy.anomaly_detection"
rules:
- record: haproxy:healthcheck_failure:rate5m
expr: |
sum without(instance, namespace,job, service, endpoint)
(rate(haproxy_server_check_failures_total[5m]))
- record: haproxy:healthcheck_failure:rate5m:avg_over_time_1w
expr: avg_over_time(haproxy:healthcheck_failure:rate5m[1w])
- record: haproxy:healthcheck_failure:rate5m:stddev_over_time_1w
expr: stddev_over_time(haproxy:healthcheck_failure:rate5m[1w])
- name: "haproxy.api_server.rules"
rules:
- alert: HaproxyHealtCheckAnomaly
expr: |
abs((
haproxy:healthcheck_failure:rate5m-
haproxy:healthcheck_failure:rate5m:avg_over_time_1w
) / haproxy:healthcheck_failure:rate5m:stddev_over_time_1w) > 3
for: 10m
labels:
severity: debug
kind: "K8sApi"
annotations:
summary: "HAproxy is detecting more failures than usual on its health checks"
description: |
This value represents the absolute z-score. Here https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/ you
can read more about how we're using it
runbook_url: "Check that HAProxy is communicating with the k8s server nodes"
- alert: HaproxyApiMasterDown
expr: haproxy_up{server=~".*master.*"} == 0
for: 15m
labels:
severity: 10x5
node: "{{ $labels.instance }}"
kind: K8sApiMaster
annotations:
summary: "HAProxy master is down (instance {{ $labels.instance }})"
description: "HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyApiMasterDown
expr: haproxy_up{server=~".*master.*"} == 0
for: 24h
labels:
severity: 10x5
node: "{{ $labels.instance }}"
kind: K8sApiMaster
annotations:
summary: "HAProxy master is down (instance {{ $labels.instance }})"
description: "HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyApiMasterDown
expr: count(haproxy_server_up{server=~".*master.*"}==0) by(instance,backend) > 1
for: 5m
labels:
severity: 24x7
node: "{{ $labels.instance }}"
kind: K8sApiMaster
annotations:
summary: "Multiple K8s master nodes are down (instance {{ $labels.instance }})"
description: "HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyApiMasterDown
expr: count(haproxy_server_up{server=~".*master.*"}==0) by(instance,backend) > 2
for: 2m
labels:
severity: 24x7
node: "{{ $labels.instance }}"
kind: K8sApiMaster
inhibits: K8sApiInfra
annotations:
summary: "All K8s master nodes are down (instance {{ $labels.instance }})"
description: "HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyApiInfraDown
expr: haproxy_up{server=~".*infra.*"} == 0
for: 15m
labels:
severity: 10x5
node: "{{ $labels.instance }}"
kind: K8sApiInfra
annotations:
summary: "HAProxy infra is down (server {{ $labels.server }})"
description: "HAProxy down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateBackend
expr: |
sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 1
for: 15m
labels:
severity: debug
annotations:
summary: "HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 4xx (> 1%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateBackend
expr: |
sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 3
for: 15m
labels:
severity: 10x5
annotations:
summary: "HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 4xx (> 3%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateBackend
expr: |
sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 5
for: 15m
labels:
severity: 24x7
annotations:
summary: "HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateBackend
expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) *100 > 1
for: 15m
labels:
severity: debug
annotations:
summary: "HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateBackend
expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) *100 > 3
for: 15m
labels:
severity: 10x5
annotations:
summary: "HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateBackend
expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) *100 > 5
for: 15m
labels:
severity: 24x7
annotations:
summary: "HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp4xxErrorRateServer
expr: sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 5
for: 5m
labels:
severity: debug
annotations:
summary: "HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyHighHttp5xxErrorRateServer
expr: sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 5
for: 5m
labels:
severity: debug
annotations:
summary: "HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})"
description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyBackendConnectionErrors
expr: sum by (backend)(rate(haproxy_backend_connection_errors_total[1m])) * 100 > 5
for: 5m
labels:
severity: 10x5
annotations:
summary: "HAProxy backend connection errors (instance {{ $labels.instance }})"
description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 5%). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyBackendConnectionErrors
expr: sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) * 100 > 35
for: 5m
labels:
severity: 24x7
annotations:
summary: "HAProxy backend connection errors (instance {{ $labels.instance }})"
description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 5%). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyServerResponseErrors
expr: sum by (server)(rate(haproxy_server_response_errors_total[1m])) * 100 > 5
for: 5m
labels:
severity: debug
annotations:
summary: "HAProxy server response errors (instance {{ $labels.instance }})"
description: "Too many response errors to {{ $labels.server }} server (> 5%).\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyServerConnectionErrors
expr: sum by (server)(rate(haproxy_server_connection_errors_total[1m])) * 100 > 5
for: 5m
labels:
severity: debug
annotations:
summary: "HAProxy server connection errors (instance {{ $labels.instance }})"
description: "Too many connection errors to {{ $labels.server }} server (> 5%). Request throughput may be to high.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyPendingRequests
expr: sum by (backend) (haproxy_backend_current_queue) > 0
for: 5m
labels:
severity: 10x5
annotations:
summary: "HAProxy pending requests (instance {{ $labels.instance }})"
description: "Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyRetryHigh
expr: sum by (backend)(rate(haproxy_backend_retry_warnings_total[1m])) > 10
for: 5m
labels:
severity: debug
annotations:
summary: "HAProxy retry high (instance {{ $labels.instance }})"
description: "High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyBackendDown
expr: haproxy_backend_up == 0
for: 5m
labels:
severity: 10x5
annotations:
summary: "HAProxy backend down (instance {{ $labels.instance }})"
description: "HAProxy backend is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyFrontendSecurityBlockedRequests
expr: sum by (frontend)(rate(haproxy_frontend_requests_denied_total[5m])) > 10
for: 5m
labels:
severity: debug
annotations:
summary: "HAProxy frontend security blocked requests (instance {{ $labels.instance }})"
description: "HAProxy is blocking requests for security reason\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyServerHealthcheckFailure
expr: increase(haproxy_server_check_failures_total[15m]) > 10
for: 5m
labels:
severity: 10x5
annotations:
summary: "HAProxy server healthcheck failure (instance {{ $labels.instance }})"
description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
- alert: HaproxyServerHealthcheckFailure
expr: increase(haproxy_server_check_failures_total[15m]) > 100
for: 5m
labels:
severity: 24x7
annotations:
summary: "HAProxy server healthcheck failure (instance {{ $labels.instance }})"
description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Just changing this 4 lines work:
lines := strings.Split(content, "\n")
if lastLine >= len(lines) {
lastLine = len(lines) - 1
}
for _, c := range lines[firstLine-1 : lastLine] {
But probably should be fixed on the line detection code. You can reduce the test case to having a single alert and being the last one like this:
groups:
- name: "haproxy.api_server.rules"
rules:
- alert: HaproxyServerHealthcheckFailure
expr: increase(haproxy_server_check_failures_total[15m]) > 100
for: 5m
labels:
severity: 24x7
annotations:
summary: "HAProxy server healthcheck failure (instance {{ $labels.instance }})"
description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
Thanks! Can you share your pint config too?
Sure!
prometheus "01-live" {
uri = "a valid prometheus url"
timeout = "60s"
}
prometheus "01-work" {
uri = "a valid prometheus url"
timeout = "60s"
}
rule {
match {
kind = "alerting"
}
# Each alert must have a 'severity' annotation that's either '24x7','10x5' or 'debug'.
label "severity" {
severity = "bug"
value = "(24x7|10x5|debug)"
required = true
}
annotation "runbook_url" {
severity = "warning"
required = true
}
}
rule {
# Disallow spaces in label/annotation keys, they're only allowed in values.
reject ".* +.*" {
label_keys = true
annotation_keys = true
}
# Disallow URLs in labels, they should go to annotations.
reject "https?://.+" {
label_keys = true
label_values = true
}
# Check how many times each alert would fire in the last 1d.
alerts {
range = "1d"
step = "1m"
resolve = "5m"
}
# Check if '{{ $value }}'/'{{ .Value }}' is used in labels
# https://www.robustperception.io/dont-put-the-value-in-alert-labels
value {}
}
It's basically a copy of the one available as an example.
It looks like escaped new lines in description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"
are turned into real new lines when parsing yaml. So when description
value is accessed it's 3 lines, rather than 1. And that's how we end up with wrong line range for this field.
Which adds to other problems with trying to use go-yaml to parse files while retaining file position and being able to use that to accurately point to problems.
I'll try to workaround this (there are already some hacks around go-yaml position handling), if that's not possible we'll emit a big warning that position might be wrong when printing out to console.
interesting, i think the warning is a good option. When do you think you'd emit this warning? Whenever you see a newline on the text? Or when there's a mismatch with the line count ?
When we try to read more lines then the source file has
Added a workaround for now, need to address the root issue
I just got this panic:
on this line: https://github.com/cloudflare/pint/blob/452a61ca4aaaca44bade5302879657454d233d06/internal/reporter/console.go#L71
Seems like the check can fail sometimes