Closed jneo8 closed 1 year ago
Add nvme prometheus alert rules
Tested with
rule_files: - ./nvme.rule evaluation_interval: 1m tests: - interval: 1m input_series: - series: node_hwmon_temp_alarm{chip="nvme_nvme1", sensor="temp1"} values: 1 alert_rule_test: - eval_time: 1m alertname: NvmeHwmonTempAlarm exp_alerts: - exp_labels: alertname: NvmeHwmonTempAlarm chip: "nvme_nvme1" severity: warning sensor: "temp1" exp_annotations: summary: Chip nvme_nvme1 throw a temperature alarm 1 - interval: 1m input_series: - series: node_filesystem_avail_bytes{device="/dev/nvme1n1p2",fstype="ext4",mountpoint="/"} values: 19 - series: node_filesystem_size_bytes{device="/dev/nvme1n1p2",fstype="ext4",mountpoint="/"} values: 100 alert_rule_test: - eval_time: 5m alertname: FileSystemPercentUsedWarn exp_alerts: - exp_labels: alertname: FileSystemPercentUsedWarn severity: warning device: "/dev/nvme1n1p2" fstype: "ext4" mountpoint: "/" exp_annotations: summary: Available disk on / is too low description: Available disk percentage on mountpoint(/) 19 is < 20% - interval: 1m input_series: - series: node_filesystem_avail_bytes{device="/dev/nvme1n1p2",fstype="ext4",mountpoint="/boot/efi"} values: 5 - series: node_filesystem_size_bytes{device="/dev/nvme1n1p2",fstype="ext4",mountpoint="/boot/efi"} values: 100 alert_rule_test: - eval_time: 5m alertname: FileSystemPercentUsedCrit exp_alerts: - exp_labels: alertname: FileSystemPercentUsedWarn severity: critical device: "/dev/nvme1n1p2" fstype: "ext4" mountpoint: "/boot/efi" exp_annotations: summary: Available disk on /boot/efi is too low description: Available disk percentage on mountpoint(/boot/efi) 5 is < 10%
I am gone to close this PR and create another two PRs, one for hwmon and another one for file system. Because the metrics is not that relate to nvme, it only relate if the chip is nvme.
Context
Add nvme prometheus alert rules
Testing Instructions
Tested with
Release Notes