Closed eugene-chow closed 5 years ago
Look into regulating the alerts so that you don't get alert fatigue
We are currently using two methods to regulate the alerts. The first approach is to configure Prometheus rules, which allows you to define the evaluation interval before an alert status is changed from pending to firing.
The next approach is to configure Alertmanager. There are three options available:
Here's our current Alertmanager configuration:
global:
slack_api_url: 'https://hooks.slack.com/services/<token>'
route:
receiver: 'slack-notifications'
group_by: [alertname, datacenter, app]
group_wait: 5m
group_interval: 10m
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
api_url: https://hooks.slack.com/services/<token>
username: '{{ template "slack.default.username" . }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '{{ template "slack.default.title" . }}'
title_link: '{{ template "slack.default.titlelink" . }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.message }}
*Runbook:* <{{ .Annotations.runbook_url }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
The interval will be adjusted to an optimum value as we continune to monitor the cluster.
Checks that should be implemented to better ensure the reliability and uptime of your infra.
With prometheus operator, it comes along with pre-configured configuration that will be monitoring the Kubernetest cluster, such as pods, nodes etc. This stack also comes with Grafana that have dashboards for better visualisation of the Prometheus metrics. However, to monitor external source, additional configurations are required for Prometheus to scrape the metrics.
Our configurations to monitor OPNsense can be found here The main alerts on OPNsense are status of instance, CPU usage, Memory usage and Disk capacity.
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
message: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- alert: CPUThresholdExceeded
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{job="firewall-node-exporter",mode="idle"}[5m])) * 100) > 90
for: 3m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} CPU usage is dangerously high"
message: "This device's CPU usage has exceeded the thresold of 90% with a value of {{ $value }} for 3 minutes."
- alert: MemoryUsageWarning
expr: ((node_memory_size_bytes - (node_memory_free_bytes + node_memory_cache_bytes + node_memory_buffer_bytes) ) / node_memory_size_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Memory usage is high"
message: "This device's Memory usage has exceeded the thresold of 80% with a value of {{ $value }} for 5 minutes."
- alert: DiskSpaceWarning
expr: 100 - ((node_filesystem_free_bytes * 100 / node_filesystem_size_bytes)) > 75
for: 60m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} Disk usage is high"
message: "This device's Disk usage has exceeded the thresold of 75% with a value of {{ $value }}."
List the monitoring checks to implement in both external and internal monitoring tools
For internal checks, we focus on these aspects for the Kubernetes cluster, VMs and physical nodes.
Most monitorings are done with Prometheus and physical nodes are monitored with Proxmox GUI.
With regards to external monitoring, we exposed an Nginx web server and used an external monitoring tool to monitor the health of the web server. In the event where the web server goes down, external monitoring will notify us and we can do a further investigation if whether the whole DC goes down.
Among the external monitoring tools, we feel that Freshping is more user friendly, easy to use and we are able to select the test location. In which Singapore is available for selection as the test location. Thus, it will not be blocked by our firewall setting. The downside of Freshping is that currently, we cant monitor our sites with https protocol due to the SSL certificate checks.
Here's a comparison of external monitoring tools
Monitoring Tools (Free) | Pros | Cons |
---|---|---|
Uptime Robot | Simple UI Easy to set up Prompt email notification |
Limited features and graphical UI |
Freshping | User friendly UI Interval check in 1 or 5 minutes Able to choose test location to perform the checks Historical Uptime to 90 days Able to create public status page on monitored sites Available to integrate with Slack etc |
Unable to ignore SSL Certificate checks |
Cula.io | User friendly UI Contect check available Available to integrate with Slack etc |
Limited test location Fixed check time interval at 2 minutes |
StatusCake | Easy to set up Able to integrate with multiple platforms such as Telegram, Ops Genie, Slack etc |
Limited features in Free version Confusing UI |
Monitoring Tools (Paid) | Pros | Cons |
---|---|---|
Pingdom | User friendly UI Visitor Insight Multiple features such as speed test etc Able to schedule planned maintenance work to pause monitoring Long range of time interval available |
Test locations are categorised in region |
Uptimia | Wide ranges of test locations available Generates reports to track the overall health of monitors Root cause analysis available |
Minimum of 10 test locations are required |
Supermonitoring | Easy to set up Able to integrate with Google Analytics panel |
Limited features Unable to select test location |
Monitis | Flexible to create your own dashboard Multiple graphical charts are available Able to export charts in PDF and CSV format |
Intermediate level knowledge required |
Binary Canary | Escalation Profiles available in the event of monitors failure Easy to set up |
Limited graphical charts for monitoring Test locations are only available in USA and Ireland |
Tie up Prometheus with the Incident Response tool and other monitoring tools (big hint here) to ensure that incidents are responded to in a timely fashion.
We have tied up Prometheus with OpsGenie as our incident response tool. For this testing, we deployed an Nginx web server and also monitored it with an external monitoring tool. We did something to the deployment file to crash the pods and the external monitoring tool promptly picked up that the web server is down. Within a few minutes, Prometheus fired the alert and it was sent to OpsGenie, which was also integrated with Slack.
Subsequently, we up the Nginx webs server again and it been reflected promptly on the external monitoring tool, as well as OpsGenie and Slack. With regards to this, may I ask if I should continue to send alerts to Slack with Alertmanager, or should I just let OpsGenie forward the alert to Slack?
Sorry for the late reply. Good work there ;) There you have 360 degree monitoring.
The main alerts on OPNsense are status of instance, CPU usage, Memory usage and Disk capacity.
Did you verify that the alerts work by running, let's say, the stress
tool?
physical nodes are monitored with Proxmox GUI
Install node_exporter
in Proxmox. Proxmox is a critical resource so it's not advisable to rely on eyeballing to monitor it.
The downside of Freshping is that currently, we cant monitor our sites with https protocol due to the SSL certificate checks.
You can get free certs from Let's Encrypt that should be recognised by Freshping
deployed an Nginx web server and also monitored it with an external monitoring tool
This webserver is a pod in the k8s cluster? It's ok to do that but it must have several replicas. Be aware that putting it there comes with a risk of it going down during k8s upgrades which will result in a barrage of CRITICAL alerts and OpsGenie calling everyone in the team and filing a ticket.
This can be avoided by configuring a maintenance window when performing an upgrade or a task that may take down this "CRITICAL" piece of the infra.
should continue to send alerts to Slack with Alertmanager, or should I just let OpsGenie forward the alert to Slack?
It depends. One approach can be:
For WARNING alerts, you don't want OpsGenie to call you incessantly during the wee hours. OpsGenie will only bug you if the alert is CRITICAL. Configuring the monitoring system takes quite some fine-tuning to get it right.
Did you verify that the alerts work by running, let's say, the stress tool?
I did test the alert by simulating high CPU usage with the command yes > /dev/null
and for i in 1 2; do while : ; do : ; done & done
where i
is the number of cores to simulate. As suggested, I have used the stress
tool and the alerts are all working fine.
Install node_exporter in Proxmox. Proxmox is a critical resource so it's not advisable to rely on eyeballing to monitor it.
Noted on this. Node_exporter has been installed on all the Proxmox nodes and exposed its metrics to Prometheus in k8s cluster.
This webserver is a pod in the k8s cluster? It's ok to do that but it must have several replicas. Be aware that putting it there comes with a risk of it going down during k8s upgrades which will result in a barrage of CRITICAL alerts and OpsGenie calling everyone in the team and filing a ticket. This can be avoided by configuring a maintenance window when performing an upgrade or a task that may take down this "CRITICAL" piece of the infra.
Yes, the webserver is a pod in the k8s cluster. Will increase the number of replicas of the webservers. On OpsGenie, It seems like only the enterprise plan has the maintenance function. In the meantime, I am also looking into other incident management tools.
It depends. One approach can be: Send WARNING alerts only to Slack Send CRITICAL alerts to OpsGenie which then sends to Slack
Noted on this. Configured as suggested.
global:
slack_api_url: 'https://hooks.slack.com/services/<token>'
route:
receiver: 'slack-notifications'
group_by: [alertname, datacenter, app]
group_wait: 5m
group_interval: 10m
routes:
- match:
severity: critical
receiver: opsgenie-notifications
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
send_resolved: true
api_url: https://hooks.slack.com/services/<token>
username: '{{ template "slack.default.username" . }}'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '{{ template "slack.default.title" . }}'
title_link: '{{ template "slack.default.titlelink" . }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.message }}
*Runbook:* <{{ .Annotations.runbook_url }}|:spiral_note_pad:>
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
- name: 'opsgenie-notifications'
opsgenie_configs:
- api_key: <token>
teams: JayCloud
Good work. Thanks!
This is an on-going task until the infra is deemed to be sufficiently well-monitored.
Tasks
Checks that should be implemented to better ensure the reliability and uptime of your infra.