canonical / grafana-agent-k8s-operator

https://charmhub.io/grafana-agent-k8s
Apache License 2.0
8 stars 18 forks source link

Add other useful alert rule based on NRPE #197

Closed rgildein closed 1 year ago

rgildein commented 1 year ago

These alert rules required to enabled following collectors:

    enable_collectors:
      - logind
      - systemd

Context

Moving rest NRPE checks from charm-nrpe.

Testing Instructions

Tested with

rule_files:
  - useful.rules

evaluation_interval: 1m

tests:
  # systemd scopes
  - interval: 1m
    input_series:
      # represent all *.scope units in failed state
      - series: node_systemd_unit_state{instance="test-model_1234_test-app_test-app/0", name="failed-units.scope", state="failed"}
        values: '0x5 20 30 40 50 60'
      # represent all *.scope units in active state
      - series: node_systemd_unit_state{instance="test-model_1234_test-app_test-app/0", name="active-units.scope", state="active"}
        values: '100x5 80 70 60 50 40'
      # represent all !*.scope units
      - series: node_systemd_unit_state{instance="test-model_1234_test-app_test-app/0", name="all-other-units.socket", state="failed"}
        values: '200x10'
    alert_rule_test:
      - eval_time: 5m
        alertname: HostSystemdFailedScopes
        exp_alerts: []  # no alert
      - eval_time: 6m
        alertname: HostSystemdFailedScopes
        exp_alerts: []  # no alert
      - eval_time: 7m
        alertname: HostSystemdFailedScopes
        exp_alerts: []  # no alert
      - eval_time: 8m
        alertname: HostSystemdFailedScopes
        exp_alerts: []  # no alert
      - eval_time: 9m
        alertname: HostSystemdFailedScopes
        exp_alerts:
          - exp_labels:
              severity: critical
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary:  Host has 50 systemd scopes in failed state (instance test-model_1234_test-app_test-app/0)
              description: >-
                Host has 50 systemd scopes in failed state.
                  VALUE = 50
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
      - eval_time: 10m
        alertname: HostSystemdFailedScopes
        exp_alerts:
          - exp_labels:
              severity: critical
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary:  Host has 60 systemd scopes in failed state (instance test-model_1234_test-app_test-app/0)
              description: >-
                Host has 60 systemd scopes in failed state.
                  VALUE = 60
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]

  # logged users
  - interval: 1m
    input_series:
      - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="user", remote="true", type="mir"}'
        values: '0x10'
      - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="user", remote="true", type="tty"}'
        values: '10x5 100x5'
      - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="user", remote="false", seat="seat0", type="tty"}'
        values: '0x8 20x2'
      - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="other", remote="false", seat="seat0", type="tty"}'
        values: '100x10'
    alert_rule_test:
      - eval_time: 3m
        alertname: HostLoggedInUsers
        exp_alerts: []  # no alert
      - eval_time: 6m
        alertname: HostLoggedInUsers
        exp_alerts: []  # no alert
      - eval_time: 7m
        alertname: HostLoggedInUsers
        exp_alerts: []  # no alert
      - eval_time: 8m
        alertname: HostLoggedInUsers
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: Host has 100 users logged-in (instance test-model_1234_test-app_test-app/0)
              description: >-
                Host has 100 users logged-in.
                  VALUE = 100
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]
      - eval_time: 9m
        alertname: HostLoggedInUsers
        exp_alerts:
          - exp_labels:
              severity: warning
              instance: test-model_1234_test-app_test-app/0
            exp_annotations:
              summary: Host has 120 users logged-in (instance test-model_1234_test-app_test-app/0)
              description: >-
                Host has 120 users logged-in.
                  VALUE = 120
                  LABELS = map[instance:test-model_1234_test-app_test-app/0]

and promtool

x1:āžœ  prometheus_alert_rules git:(nrpe/rest-alert-rules) āœ— promtool test rules ./test_useful.yaml
Unit Testing:  ./test_useful.yaml
  SUCCESS
                                                                                                  [0.06s]

Release Notes

simskij commented 1 year ago

Ready to merge once https://github.com/canonical/grafana-agent-k8s-operator/pull/202 makes it into main. Also seems like it will require a rebase at that point, @rgildein. šŸ‘šŸ¼

rgildein commented 1 year ago

@simskij rebased