canonical / grafana-agent-k8s-operator

https://charmhub.io/grafana-agent-k8s
Apache License 2.0
8 stars 18 forks source link

Add more alert rules for memory #185

Closed rgildein closed 1 year ago

rgildein commented 1 year ago

Context

Moving memory NRPE checks from charm-nrpe.

Testing Instructions

Tested with

rule_files:
  - memory.rules

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'node_memory_MemFree_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '450 380 250 92 50 31 29 5 5 5 48 150 380 450 450'
      - series: 'node_memory_Cached_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x15'
      - series: 'node_memory_Buffers_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x15'
      - series: 'node_memory_SReclaimable_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x15'
      - series: 'node_memory_MemTotal_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '512x15'
      - series: 'node_memory_SwapFree_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '128x5 40 40 40 40 40 128x5'
      - series: 'node_memory_SwapCached_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x15'
      - series: 'node_memory_SwapTotal_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '128x15'
    promql_expr_test:
      - expr: node_memory_MemUsed_percentage
        eval_time: 4m
        exp_samples:
          - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
            value: 90.234375
      - expr: node_memory_MemUsed_percentage
        eval_time: 5m
        exp_samples:
          - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
            value: 93.9453125
      - expr: node_memory_MemUsed_percentage
        eval_time: 6m
        exp_samples:
           - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
             value: 94.3359375
      - expr: node_memory_MemUsed_percentage
        eval_time: 7m
        exp_samples:
           - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
             value: 99.0234375
      - expr: node_memory_MemUsed_percentage
        eval_time: 8m
        exp_samples:
           - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
             value: 99.0234375
      - expr: avg_over_time(node_memory_MemUsed_percentage[1m])
        eval_time: 4m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 86.1328125
      - expr: avg_over_time(node_memory_MemUsed_percentage[1m])
        eval_time: 5m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 92.08984375
      - expr: avg_over_time(node_memory_MemUsed_percentage[1m])
        eval_time: 6m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 94.140625
      - expr: avg_over_time(node_memory_MemUsed_percentage[1m])
        eval_time: 7m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 96.6796875
      - expr: avg_over_time(node_memory_MemUsed_percentage[1m])
        eval_time: 8m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 99.0234375
      - expr: avg_over_time(node_memory_MemUsed_percentage[1m])
        eval_time: 9m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 99.0234375
      - expr: avg_over_time(node_memory_SwapUsed_percentage[1m])
        eval_time: 8m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 68.75
      - expr: avg_over_time(node_memory_SwapUsed_percentage[1m])
        eval_time: 9m
        exp_samples:
           - labels: '{instance="test-model_1234_test-app_test-app/0"}'
             value: 68.75
    alert_rule_test:
      - eval_time: 9m
        alertname: HostMemoryFull
        exp_alerts:
          - exp_labels:
              severity: critical
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: "Host memory usage reached 99% load (instance test-model_1234_test-app_test-app/0)"
              description: >-
                Host memory usage reached 99%
                  VALUE = 99.0234375
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
      - eval_time: 9m
        alertname: HostSwapFull
        exp_alerts:
          - exp_labels:
              severity: critical
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: "Host memory and swap usage reached 90% and 50% load (instance test-model_1234_test-app_test-app/0)"
              description: >-
                Host memory and swap usage reached 90% and 50% load
                  VALUE = 99.0234375
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]

  # test for prediction of memory usage
  - interval: 1m
    input_series:
      - series: 'node_memory_MemFree_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '95-2.5x36 5x24'
      - series: 'node_memory_Cached_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x60'
      - series: 'node_memory_Buffers_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x60'
      - series: 'node_memory_SReclaimable_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '0x60'
      - series: 'node_memory_MemTotal_bytes{instance="test-model_1234_test-app_test-app/0"}'
        values: '100x60'
    promql_expr_test:
      - expr: node_memory_MemUsed_percentage
        eval_time: 30m  # 95 - 75 = 20
        exp_samples:
          - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
            value: 80
      - expr: node_memory_MemUsed_percentage
        eval_time: 34m  # 95 - 85 = 10
        exp_samples:
          - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
            value: 90
      - expr: node_memory_MemUsed_percentage
        eval_time: 36m  # 95 - 90 = 5
        exp_samples:
          - labels: 'node_memory_MemUsed_percentage{instance="test-model_1234_test-app_test-app/0"}'
            value: 95
      - expr: predict_linear(node_memory_MemUsed_percentage[30m], 5*60)
        eval_time: 30m
        exp_samples:
          - labels: '{instance="test-model_1234_test-app_test-app/0"}'
            value: 92.5
      - expr: predict_linear(node_memory_MemUsed_percentage[30m], 5*60)
        eval_time: 31m
        exp_samples:
          - labels: '{instance="test-model_1234_test-app_test-app/0"}'
            value: 95
      - expr: predict_linear(node_memory_MemUsed_percentage[30m], 5*60)
        eval_time: 32m
        exp_samples:
          - labels: '{instance="test-model_1234_test-app_test-app/0"}'
            value: 97.5
    alert_rule_test:
      - eval_time: 30m
        alertname: HostMemoryFillsUp
        exp_alerts: []  # no alert
      - eval_time: 31m
        alertname: HostMemoryFillsUp
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: "[Prediction] Host memory usage will increase to 95% in the near future (instance test-model_1234_test-app_test-app/0)"
              description: >-
                Host can potentially reach 95% memory utilization and risk an OOM kill.
                  VALUE = 95
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
                The 5-minute-ahead prediction is made as a linear regression from the last 30 minutes of data.
      - eval_time: 32m
        alertname: HostMemoryFillsUp
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: "[Prediction] Host memory usage will increase to 98% in the near future (instance test-model_1234_test-app_test-app/0)"
              description: >-
                Host can potentially reach 98% memory utilization and risk an OOM kill.
                  VALUE = 97.5
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
                The 5-minute-ahead prediction is made as a linear regression from the last 30 minutes of data.
      - eval_time: 33m
        alertname: HostMemoryFillsUp
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: "[Prediction] Host memory usage will increase to 100% in the near future (instance test-model_1234_test-app_test-app/0)"
              description: >-
                Host can potentially reach 100% memory utilization and risk an OOM kill.
                  VALUE = 100
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
                The 5-minute-ahead prediction is made as a linear regression from the last 30 minutes of data.
      - eval_time: 34m
        alertname: HostMemoryFillsUp
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: "[Prediction] Host memory usage will increase to 102% in the near future (instance test-model_1234_test-app_test-app/0)"
              description: >-
                Host can potentially reach 102% memory utilization and risk an OOM kill.
                  VALUE = 102.5
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
                The 5-minute-ahead prediction is made as a linear regression from the last 30 minutes of data.
      - eval_time: 35m
        alertname: HostMemoryFillsUp
        exp_alerts: []  # no alert
      - eval_time: 40m
        alertname: HostMemoryFillsUp
        exp_alerts: []  # no alert during stable high memory load

  - interval: 1m
    input_series:
      - series: 'node_vmstat_pgmajfault{instance="test-model_1234_test-app_test-app/0"}'
        values: '50000+75000x10'
    alert_rule_test:
      - eval_time: 5m
        alertname: HostMemoryUnderMemoryPressure
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: Host memory under memory pressure (instance test-model_1234_test-app_test-app/0)
              description: >-
                The node is under heavy memory pressure. High rate of major page faults.
                  VALUE = 1250
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]

and promtool

x1:➜  prometheus_alert_rules git:(nrpe/memory-aler-rules) ✗ promtool test rules ./test_memory.yaml
Unit Testing:  ./test_memory.yaml
  SUCCESS
                                                                                                  [0.13s]

Release Notes

rgildein commented 1 year ago

I want to ask if we can add unit tests for alert rules to the repo and integrate it with CI? it will be useful for future improvement and maintenance.

rbarry82 commented 1 year ago

I want to ask if we can add unit tests for alert rules to the repo and integrate it with CI? it will be useful for future improvement and maintenance.

What kind of unit tests did you have in mind? We have some in the Prometheus/Loki repos (and pretty much all of the unit tests are also integrated with CI), but without knowing the general outline of what your idea was, it's hard to say conclusively whether it's already done, possible to implement, or a non-starter.

rgildein commented 1 year ago

What kind of unit tests did you have in mind? We have some in the Prometheus/Loki repos (and pretty much all of the unit tests are also integrated with CI), but without knowing the general outline of what your idea was, it's hard to say conclusively whether it's already done, possible to implement, or a non-starter.

I used promtool to run unit tests for alert rules, which I provided in the description. I was following official documentation for unit testing rules.

rbarry82 commented 1 year ago

I used promtool to run unit tests for alert rules, which I provided in the description. I was following official documentation for unit testing rules.

Without seeing what the promtool "unit tests" do, it's really hard to say what this does.

cos-tool can validate whether rules are valid or not (which appears to be what this is doing), and this is done in other repos. But when we speak about unit tests in the context of charms, we usually mean something with much more specificity. We do also validate rules as parts of some of these, if that's all your looking for, but they're also (additionally) validated at runtime with appropriate messages sent back in relation data if they aren't.

rgildein commented 1 year ago

I used promtool to run unit tests for alert rules, which I provided in the description. I was following official documentation for unit testing rules.

Without seeing what the promtool "unit tests" do, it's really hard to say what this does.

cos-tool can validate whether rules are valid or not (which appears to be what this is doing), and this is done in other repos. But when we speak about unit tests in the context of charms, we usually mean something with much more specificity. We do also validate rules as parts of some of these, if that's all your looking for, but they're also (additionally) validated at runtime with appropriate messages sent back in relation data if they aren't.

The "unit testing" w/ promtool is not only for validating format of rules, but more for validating when those alerts are actually firing. By defining the test metrics and then checking expected outputs of queries or expecting fire from alert. For me these is useful since I'm new in PromQL and also it's hard to create environment with high network usage / errors / etc.

In the unit tests you mention we are only checking (if I'm right) rules existence and how those rules files are proceed, but not actual alerting. That's why I think it would be nice to somehow integrate these "unit" (I do not like the name) tests, which can be tested w/ promtool (is part of Prometheus snap).

rbarry82 commented 1 year ago

I used promtool to run unit tests for alert rules, which I provided in the description. I was following official documentation for unit testing rules.

Without seeing what the promtool "unit tests" do, it's really hard to say what this does. cos-tool can validate whether rules are valid or not (which appears to be what this is doing), and this is done in other repos. But when we speak about unit tests in the context of charms, we usually mean something with much more specificity. We do also validate rules as parts of some of these, if that's all your looking for, but they're also (additionally) validated at runtime with appropriate messages sent back in relation data if they aren't.

The "unit testing" w/ promtool is not only for validating format of rules, but more for validating when those alerts are actually firing. By defining the test metrics and then checking expected outputs of queries or expecting fire from alert. For me these is useful since I'm new in PromQL and also it's hard to create environment with high network usage / errors / etc.

In the unit tests you mention we are only checking (if I'm right) rules existence and how those rules files are proceed, but not actual alerting. That's why I think it would be nice to somehow integrate these "unit" (I do not like the name) tests, which can be tested w/ promtool (is part of Prometheus snap).

I'd suggest that this is more functional/integration testing than unit testing (as much as it kind of straddles the middle ground).

But as above, the "right" way to do this in the charming ecosystem is actually by adding an integration test which would deploy prometheus-k8s and grafana-agent (as a subordinate to some machine charm) as actual charms, relate them, and ensure that the rules loaded successfully and/or that the metrics which they would alert on are present. Triggering the alert itself after that would take tweaking, but isn't strictly necessary.

From a charming POV, promtool isn't particularly relevant or useful. We'd be interested in seeing whether the annotated alerts/metrics (with juju topology inserted by cos-tool) for some appropriate client are present/firing.

I do nominally see a use in what you're proposing, but more from a "pure" Prometheus POV than a charmed one.

rgildein commented 1 year ago

I'd suggest that this is more functional/integration testing than unit testing (as much as it kind of straddles the middle ground).

But as above, the "right" way to do this in the charming ecosystem is actually by adding an integration test which would deploy prometheus-k8s and grafana-agent (as a subordinate to some machine charm) as actual charms, relate them, and ensure that the rules loaded successfully and/or that the metrics which they would alert on are present. Triggering the alert itself after that would take tweaking, but isn't strictly necessary.

From a charming POV, promtool isn't particularly relevant or useful. We'd be interested in seeing whether the annotated alerts/metrics (with juju topology inserted by cos-tool) for some appropriate client are present/firing.

I do nominally see a use in what you're proposing, but more from a "pure" Prometheus POV than a charmed one.

I completely agreed on this, these are not really unit tests and I was thinking that we could have one functional tests actually running promtool on one grafana-agent unit, but this will be more confusing to see in functional tests

My second suggestion is to provide simple md file how these tests worked and how they can be used along with actual yaml file containing those tests. Something like this:

tests
├── alert-rules
│   ├── README.md
│   └── tests.yaml
├── integration
...

What do you say?