Closed rgildein closed 1 year ago
These alert rules required to enabled following collectors:
enable_collectors: - logind - systemd
Moving rest NRPE checks from charm-nrpe.
Tested with
rule_files: - useful.rules evaluation_interval: 1m tests: # systemd scopes - interval: 1m input_series: # represent all *.scope units in failed state - series: node_systemd_unit_state{instance="test-model_1234_test-app_test-app/0", name="failed-units.scope", state="failed"} values: '0x5 20 30 40 50 60' # represent all *.scope units in active state - series: node_systemd_unit_state{instance="test-model_1234_test-app_test-app/0", name="active-units.scope", state="active"} values: '100x5 80 70 60 50 40' # represent all !*.scope units - series: node_systemd_unit_state{instance="test-model_1234_test-app_test-app/0", name="all-other-units.socket", state="failed"} values: '200x10' alert_rule_test: - eval_time: 5m alertname: HostSystemdFailedScopes exp_alerts: [] # no alert - eval_time: 6m alertname: HostSystemdFailedScopes exp_alerts: [] # no alert - eval_time: 7m alertname: HostSystemdFailedScopes exp_alerts: [] # no alert - eval_time: 8m alertname: HostSystemdFailedScopes exp_alerts: [] # no alert - eval_time: 9m alertname: HostSystemdFailedScopes exp_alerts: - exp_labels: severity: critical instance: test-model_1234_test-app_test-app/0 exp_annotations: summary: Host has 50 systemd scopes in failed state (instance test-model_1234_test-app_test-app/0) description: >- Host has 50 systemd scopes in failed state. VALUE = 50 LABELS = map[instance:test-model_1234_test-app_test-app/0] - eval_time: 10m alertname: HostSystemdFailedScopes exp_alerts: - exp_labels: severity: critical instance: test-model_1234_test-app_test-app/0 exp_annotations: summary: Host has 60 systemd scopes in failed state (instance test-model_1234_test-app_test-app/0) description: >- Host has 60 systemd scopes in failed state. VALUE = 60 LABELS = map[instance:test-model_1234_test-app_test-app/0] # logged users - interval: 1m input_series: - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="user", remote="true", type="mir"}' values: '0x10' - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="user", remote="true", type="tty"}' values: '10x5 100x5' - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="user", remote="false", seat="seat0", type="tty"}' values: '0x8 20x2' - series: 'node_logind_sessions{instance="test-model_1234_test-app_test-app/0", class="other", remote="false", seat="seat0", type="tty"}' values: '100x10' alert_rule_test: - eval_time: 3m alertname: HostLoggedInUsers exp_alerts: [] # no alert - eval_time: 6m alertname: HostLoggedInUsers exp_alerts: [] # no alert - eval_time: 7m alertname: HostLoggedInUsers exp_alerts: [] # no alert - eval_time: 8m alertname: HostLoggedInUsers exp_alerts: - exp_labels: severity: warning instance: test-model_1234_test-app_test-app/0 exp_annotations: summary: Host has 100 users logged-in (instance test-model_1234_test-app_test-app/0) description: >- Host has 100 users logged-in. VALUE = 100 LABELS = map[instance:test-model_1234_test-app_test-app/0] - eval_time: 9m alertname: HostLoggedInUsers exp_alerts: - exp_labels: severity: warning instance: test-model_1234_test-app_test-app/0 exp_annotations: summary: Host has 120 users logged-in (instance test-model_1234_test-app_test-app/0) description: >- Host has 120 users logged-in. VALUE = 120 LABELS = map[instance:test-model_1234_test-app_test-app/0]
and promtool
x1:ā prometheus_alert_rules git:(nrpe/rest-alert-rules) ā promtool test rules ./test_useful.yaml Unit Testing: ./test_useful.yaml SUCCESS [0.06s]
Ready to merge once https://github.com/canonical/grafana-agent-k8s-operator/pull/202 makes it into main. Also seems like it will require a rebase at that point, @rgildein. šš¼
@simskij rebased
These alert rules required to enabled following collectors:
Context
Moving rest NRPE checks from charm-nrpe.
Testing Instructions
Tested with
and promtool
Release Notes