cloudflare / pint

Prometheus rule linter/validator
https://cloudflare.github.io/pint/
Apache License 2.0
826 stars 47 forks source link

Panic, slice index out of bounds #20

Open xocasdashdash opened 3 years ago

xocasdashdash commented 3 years ago

I just got this panic:

panic: runtime error: slice bounds out of range [:243] with capacity 242

goroutine 1 [running]:
github.com/cloudflare/pint/internal/reporter.ConsoleReporter.Submit(0xbb2440, 0xc00011e010, 0xc0012e8000, 0xd3, 0x143, 0xbbddd0, 0xc00026cab0, 0x2, 0x2)
    /home/joaquin/projects/personal/github/pint/internal/reporter/console.go:71 +0x1073
main.actionLint(0xc0001a7740, 0x2, 0x2)
    /home/joaquin/projects/personal/github/pint/cmd/pint/lint.go:47 +0x56a
github.com/urfave/cli/v2.(*Command).Run(0xc00016d440, 0xc0001a7600, 0x0, 0x0)
    /home/joaquin/.asdf/installs/golang/1.16.3/packages/pkg/mod/github.com/urfave/cli/v2@v2.3.0/command.go:163 +0x4dd
github.com/urfave/cli/v2.(*App).RunContext(0xc0002321a0, 0xbbd5f0, 0xc00011a010, 0xc000124000, 0x5, 0x5, 0x0, 0x0)
    /home/joaquin/.asdf/installs/golang/1.16.3/packages/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:313 +0x810
github.com/urfave/cli/v2.(*App).Run(...)
    /home/joaquin/.asdf/installs/golang/1.16.3/packages/pkg/mod/github.com/urfave/cli/v2@v2.3.0/app.go:224
main.main()
    /home/joaquin/projects/personal/github/pint/cmd/pint/main.go:72 +0x106

on this line: https://github.com/cloudflare/pint/blob/452a61ca4aaaca44bade5302879657454d233d06/internal/reporter/console.go#L71

Seems like the check can fail sometimes

prymitive commented 3 years ago

Would you mind sharing a rule file it fails on?

xocasdashdash commented 3 years ago

I had to bisect all the rules (hard to know which one it's reporting on) but this one makes it fail consistently. Please note theres an empty line at the end

groups:
  - name: "haproxy.anomaly_detection"
    rules:
      - record: haproxy:healthcheck_failure:rate5m
        expr: |
          sum without(instance, namespace,job, service, endpoint)
          (rate(haproxy_server_check_failures_total[5m]))
      - record: haproxy:healthcheck_failure:rate5m:avg_over_time_1w
        expr: avg_over_time(haproxy:healthcheck_failure:rate5m[1w])
      - record: haproxy:healthcheck_failure:rate5m:stddev_over_time_1w
        expr: stddev_over_time(haproxy:healthcheck_failure:rate5m[1w])
  - name: "haproxy.api_server.rules"
    rules:
      - alert: HaproxyHealtCheckAnomaly
        expr: |
          abs((
          haproxy:healthcheck_failure:rate5m-
          haproxy:healthcheck_failure:rate5m:avg_over_time_1w
          ) / haproxy:healthcheck_failure:rate5m:stddev_over_time_1w) > 3
        for: 10m
        labels: 
          severity: debug
          kind: "K8sApi"
        annotations: 
          summary: "HAproxy is detecting more failures than usual on its health checks"
          description: |
            This value represents the absolute z-score. Here https://about.gitlab.com/blog/2019/07/23/anomaly-detection-using-prometheus/ you
            can read more about how we're using it
          runbook_url: "Check that HAProxy is communicating with the k8s server nodes"
      - alert: HaproxyApiMasterDown
        expr: haproxy_up{server=~".*master.*"} == 0
        for: 15m
        labels:
          severity: 10x5
          node: "{{ $labels.instance }}"
          kind: K8sApiMaster
        annotations:
          summary: "HAProxy master is down (instance {{ $labels.instance }})"
          description: "HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyApiMasterDown
        expr: haproxy_up{server=~".*master.*"} == 0
        for: 24h
        labels:
          severity: 10x5
          node: "{{ $labels.instance }}"
          kind: K8sApiMaster
        annotations:
          summary: "HAProxy master is down (instance {{ $labels.instance }})"
          description: "HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyApiMasterDown
        expr: count(haproxy_server_up{server=~".*master.*"}==0) by(instance,backend) > 1
        for: 5m
        labels:
          severity: 24x7
          node: "{{ $labels.instance }}"
          kind: K8sApiMaster
        annotations:
          summary: "Multiple K8s master nodes are down (instance {{ $labels.instance }})"
          description: "HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyApiMasterDown
        expr: count(haproxy_server_up{server=~".*master.*"}==0) by(instance,backend) > 2
        for: 2m
        labels:
          severity: 24x7
          node: "{{ $labels.instance }}"
          kind: K8sApiMaster
          inhibits: K8sApiInfra
        annotations:
          summary: "All K8s master nodes are down (instance {{ $labels.instance }})"
          description: "HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyApiInfraDown
        expr: haproxy_up{server=~".*infra.*"} == 0
        for: 15m
        labels:
          severity: 10x5
          node: "{{ $labels.instance }}"
          kind: K8sApiInfra
        annotations:
          summary: "HAProxy infra is down (server {{ $labels.server }})"
          description: "HAProxy down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyHighHttp4xxErrorRateBackend
        expr: |
          sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 1
        for: 15m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 4xx (> 1%) on backend {{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyHighHttp4xxErrorRateBackend
        expr: |
          sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 3
        for: 15m
        labels:
          severity: 10x5
        annotations:
          summary: "HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 4xx (> 3%) on backend {{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyHighHttp4xxErrorRateBackend
        expr: |
          sum by (backend) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 5
        for: 15m
        labels:
          severity: 24x7
        annotations:
          summary: "HAProxy high HTTP 4xx error rate backend (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 4xx (> 5%) on backend {{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyHighHttp5xxErrorRateBackend
        expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) *100 > 1
        for: 15m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyHighHttp5xxErrorRateBackend
        expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) *100 > 3
        for: 15m
        labels:
          severity: 10x5
        annotations:
          summary: "HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyHighHttp5xxErrorRateBackend
        expr: sum by (backend) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) *100 > 5
        for: 15m
        labels:
          severity: 24x7
        annotations:
          summary: "HAProxy high HTTP 5xx error rate backend (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 5xx (> 5%) on backend {{ $labels.backend }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyHighHttp4xxErrorRateServer
        expr: sum by (server) (rate(haproxy_server_http_responses_total{code="4xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 5
        for: 5m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy high HTTP 4xx error rate server (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 4xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyHighHttp5xxErrorRateServer
        expr: sum by (server) (rate(haproxy_server_http_responses_total{code="5xx"}[1m])) / sum by (backend) (rate(haproxy_server_http_responses_total{}[1m])) * 100 > 5
        for: 5m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy high HTTP 5xx error rate server (instance {{ $labels.instance }})"
          description: "Too many HTTP requests with status 5xx (> 5%) on server {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyBackendConnectionErrors
        expr: sum by (backend)(rate(haproxy_backend_connection_errors_total[1m])) * 100 > 5
        for: 5m
        labels:
          severity: 10x5
        annotations:
          summary: "HAProxy backend connection errors (instance {{ $labels.instance }})"
          description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 5%). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyBackendConnectionErrors
        expr: sum by (backend) (rate(haproxy_backend_connection_errors_total[1m])) * 100 > 35
        for: 5m
        labels:
          severity: 24x7
        annotations:
          summary: "HAProxy backend connection errors (instance {{ $labels.instance }})"
          description: "Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 5%). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyServerResponseErrors
        expr: sum by (server)(rate(haproxy_server_response_errors_total[1m])) * 100 > 5
        for: 5m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy server response errors (instance {{ $labels.instance }})"
          description: "Too many response errors to {{ $labels.server }} server (> 5%).\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyServerConnectionErrors
        expr: sum by (server)(rate(haproxy_server_connection_errors_total[1m])) * 100 > 5
        for: 5m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy server connection errors (instance {{ $labels.instance }})"
          description: "Too many connection errors to {{ $labels.server }} server (> 5%). Request throughput may be to high.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyPendingRequests
        expr: sum by (backend) (haproxy_backend_current_queue) > 0
        for: 5m
        labels:
          severity: 10x5
        annotations:
          summary: "HAProxy pending requests (instance {{ $labels.instance }})"
          description: "Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"      

      - alert: HaproxyRetryHigh
        expr: sum by (backend)(rate(haproxy_backend_retry_warnings_total[1m])) > 10
        for: 5m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy retry high (instance {{ $labels.instance }})"
          description: "High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyBackendDown
        expr: haproxy_backend_up == 0
        for: 5m
        labels:
          severity: 10x5
        annotations:
          summary: "HAProxy backend down (instance {{ $labels.instance }})"
          description: "HAProxy backend is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyFrontendSecurityBlockedRequests
        expr: sum by (frontend)(rate(haproxy_frontend_requests_denied_total[5m])) > 10
        for: 5m
        labels:
          severity: debug
        annotations:
          summary: "HAProxy frontend security blocked requests (instance {{ $labels.instance }})"
          description: "HAProxy is blocking requests for security reason\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

      - alert: HaproxyServerHealthcheckFailure
        expr: increase(haproxy_server_check_failures_total[15m]) > 10
        for: 5m
        labels:
          severity: 10x5
        annotations:
          summary: "HAProxy server healthcheck failure (instance {{ $labels.instance }})"
          description: "Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
      - alert: HaproxyServerHealthcheckFailure
        expr: increase(haproxy_server_check_failures_total[15m]) > 100
        for: 5m
        labels:
          severity: 24x7
        annotations:
          summary: "HAProxy server healthcheck failure (instance {{ $labels.instance }})"
          description: "Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
xocasdashdash commented 3 years ago

Just changing this 4 lines work:

lines := strings.Split(content, "\n")
if lastLine >= len(lines) {
    lastLine = len(lines) - 1
}
for _, c := range lines[firstLine-1 : lastLine] {

But probably should be fixed on the line detection code. You can reduce the test case to having a single alert and being the last one like this:

groups:
  - name: "haproxy.api_server.rules"
    rules:
      - alert: HaproxyServerHealthcheckFailure
        expr: increase(haproxy_server_check_failures_total[15m]) > 100
        for: 5m
        labels:
          severity: 24x7
        annotations:
          summary: "HAProxy server healthcheck failure (instance {{ $labels.instance }})"
          description: "Some server healthcheck are failing on {{ $labels.server }}\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
prymitive commented 3 years ago

Thanks! Can you share your pint config too?

xocasdashdash commented 3 years ago

Sure!

prometheus "01-live" {
  uri     = "a valid prometheus url"
  timeout = "60s"
}
prometheus "01-work" {
  uri     = "a valid prometheus url"
  timeout = "60s"
}
rule {
  match {
    kind = "alerting"
  }
  # Each alert must have a 'severity' annotation that's either '24x7','10x5' or 'debug'.
  label "severity" {
    severity = "bug"
    value    = "(24x7|10x5|debug)"
    required = true
  }
  annotation "runbook_url" {
    severity = "warning"
    required = true
  }
}

rule {
  # Disallow spaces in label/annotation keys, they're only allowed in values.
  reject ".* +.*" {
    label_keys      = true
    annotation_keys = true
  }

  # Disallow URLs in labels, they should go to annotations.
  reject "https?://.+" {
    label_keys   = true
    label_values = true
  }
  # Check how many times each alert would fire in the last 1d.
  alerts {
    range   = "1d"
    step    = "1m"
    resolve = "5m"
  }
  # Check if '{{ $value }}'/'{{ .Value }}' is used in labels
  # https://www.robustperception.io/dont-put-the-value-in-alert-labels
  value {}
}

It's basically a copy of the one available as an example.

prymitive commented 3 years ago

It looks like escaped new lines in description: "Some server healthcheck are failing on {{ $labels.server }}\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" are turned into real new lines when parsing yaml. So when description value is accessed it's 3 lines, rather than 1. And that's how we end up with wrong line range for this field. Which adds to other problems with trying to use go-yaml to parse files while retaining file position and being able to use that to accurately point to problems. I'll try to workaround this (there are already some hacks around go-yaml position handling), if that's not possible we'll emit a big warning that position might be wrong when printing out to console.

xocasdashdash commented 3 years ago

interesting, i think the warning is a good option. When do you think you'd emit this warning? Whenever you see a newline on the text? Or when there's a mismatch with the line count ?

prymitive commented 3 years ago

When we try to read more lines then the source file has

prymitive commented 3 years ago

Added a workaround for now, need to address the root issue