Identify services and endpoints to monitor

eugene-chow commented 5 years ago

This is an on-going task until the infra is deemed to be sufficiently well-monitored.

Tasks

[x] List the monitoring checks to implement in both external and internal monitoring tools
[x] Look into regulating the alerts so that you don't get alert fatigue
[x] Tie up Prometheus with the Incident Response tool and other monitoring tools (big hint here) to ensure that incidents are responded to in a timely fashion.

Checks that should be implemented to better ensure the reliability and uptime of your infra.

[x] CPU load
[x] Memory usage
[x] Disk usage
[x] Proxmox GUI interface

youngee91 commented 5 years ago

Look into regulating the alerts so that you don't get alert fatigue

We are currently using two methods to regulate the alerts. The first approach is to configure Prometheus rules, which allows you to define the evaluation interval before an alert status is changed from pending to firing.

The next approach is to configure Alertmanager. There are three options available:

group_interval - How long to wait before sending an alert that has been added to a group which consist of fired alerts
group_wait - How long time to wait to buffer alerts of the same group before sending
repeat_interval - How long to wait before sending a given alert again that has already been sent

Here's our current Alertmanager configuration:

global:
  slack_api_url: 'https://hooks.slack.com/services/<token>'
route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]
  group_wait: 5m
  group_interval: 10m
receivers:
- name: 'slack-notifications'
  slack_configs:
    - channel: '#alerts'
      send_resolved: true
      api_url: https://hooks.slack.com/services/<token>
      username: '{{ template "slack.default.username" . }}'
      color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
      title: '{{ template "slack.default.title" . }}'
      title_link: '{{ template "slack.default.titlelink" . }}'
      text: >-
        {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
          *Description:* {{ .Annotations.message }}
          *Runbook:* <{{ .Annotations.runbook_url }}|:spiral_note_pad:>
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}

The interval will be adjusted to an optimum value as we continune to monitor the cluster.

Checks that should be implemented to better ensure the reliability and uptime of your infra.

With prometheus operator, it comes along with pre-configured configuration that will be monitoring the Kubernetest cluster, such as pods, nodes etc. This stack also comes with Grafana that have dashboards for better visualisation of the Prometheus metrics. However, to monitor external source, additional configurations are required for Prometheus to scrape the metrics.

Our configurations to monitor OPNsense can be found here The main alerts on OPNsense are status of instance, CPU usage, Memory usage and Disk capacity.

  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      message: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
  - alert: CPUThresholdExceeded
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{job="firewall-node-exporter",mode="idle"}[5m])) * 100) > 90
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usage is dangerously high"
      message: "This device's CPU usage has exceeded the thresold of 90% with a value of {{ $value }} for 3 minutes."
  - alert: MemoryUsageWarning
    expr:  ((node_memory_size_bytes - (node_memory_free_bytes + node_memory_cache_bytes + node_memory_buffer_bytes) ) / node_memory_size_bytes) * 100  > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Memory usage is high"
      message: "This device's Memory usage has exceeded the thresold of 80% with a value of {{ $value }} for 5 minutes."
  - alert: DiskSpaceWarning
    expr:  100 -  ((node_filesystem_free_bytes  * 100 / node_filesystem_size_bytes))  > 75
    for: 60m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Disk usage is high"
      message: "This device's Disk usage has exceeded the thresold of 75% with a value of {{ $value }}."

youngee91 commented 5 years ago

List the monitoring checks to implement in both external and internal monitoring tools

For internal checks, we focus on these aspects for the Kubernetes cluster, VMs and physical nodes.

Availability
CPU
Memory
Disk
Response Time

Most monitorings are done with Prometheus and physical nodes are monitored with Proxmox GUI.

With regards to external monitoring, we exposed an Nginx web server and used an external monitoring tool to monitor the health of the web server. In the event where the web server goes down, external monitoring will notify us and we can do a further investigation if whether the whole DC goes down.

Among the external monitoring tools, we feel that Freshping is more user friendly, easy to use and we are able to select the test location. In which Singapore is available for selection as the test location. Thus, it will not be blocked by our firewall setting. The downside of Freshping is that currently, we cant monitor our sites with https protocol due to the SSL certificate checks.

Here's a comparison of external monitoring tools

Monitoring Tools (Free)	Pros	Cons
Uptime Robot	Simple UI Easy to set up Prompt email notification	Limited features and graphical UI
Freshping	User friendly UI Interval check in 1 or 5 minutes Able to choose test location to perform the checks Historical Uptime to 90 days Able to create public status page on monitored sites Available to integrate with Slack etc	Unable to ignore SSL Certificate checks
Cula.io	User friendly UI Contect check available Available to integrate with Slack etc	Limited test location Fixed check time interval at 2 minutes
StatusCake	Easy to set up Able to integrate with multiple platforms such as Telegram, Ops Genie, Slack etc	Limited features in Free version Confusing UI

Monitoring Tools (Paid)	Pros	Cons
Pingdom	User friendly UI Visitor Insight Multiple features such as speed test etc Able to schedule planned maintenance work to pause monitoring Long range of time interval available	Test locations are categorised in region
Uptimia	Wide ranges of test locations available Generates reports to track the overall health of monitors Root cause analysis available	Minimum of 10 test locations are required
Supermonitoring	Easy to set up Able to integrate with Google Analytics panel	Limited features Unable to select test location
Monitis	Flexible to create your own dashboard Multiple graphical charts are available Able to export charts in PDF and CSV format	Intermediate level knowledge required
Binary Canary	Escalation Profiles available in the event of monitors failure Easy to set up	Limited graphical charts for monitoring Test locations are only available in USA and Ireland

youngee91 commented 5 years ago

Tie up Prometheus with the Incident Response tool and other monitoring tools (big hint here) to ensure that incidents are responded to in a timely fashion.

We have tied up Prometheus with OpsGenie as our incident response tool. For this testing, we deployed an Nginx web server and also monitored it with an external monitoring tool. We did something to the deployment file to crash the pods and the external monitoring tool promptly picked up that the web server is down. Within a few minutes, Prometheus fired the alert and it was sent to OpsGenie, which was also integrated with Slack.

Subsequently, we up the Nginx webs server again and it been reflected promptly on the external monitoring tool, as well as OpsGenie and Slack. With regards to this, may I ask if I should continue to send alerts to Slack with Alertmanager, or should I just let OpsGenie forward the alert to Slack?

eugene-chow commented 5 years ago

Sorry for the late reply. Good work there ;) There you have 360 degree monitoring.

The main alerts on OPNsense are status of instance, CPU usage, Memory usage and Disk capacity.

Did you verify that the alerts work by running, let's say, the stress tool?

physical nodes are monitored with Proxmox GUI

Install node_exporter in Proxmox. Proxmox is a critical resource so it's not advisable to rely on eyeballing to monitor it.

The downside of Freshping is that currently, we cant monitor our sites with https protocol due to the SSL certificate checks.

You can get free certs from Let's Encrypt that should be recognised by Freshping

deployed an Nginx web server and also monitored it with an external monitoring tool

This webserver is a pod in the k8s cluster? It's ok to do that but it must have several replicas. Be aware that putting it there comes with a risk of it going down during k8s upgrades which will result in a barrage of CRITICAL alerts and OpsGenie calling everyone in the team and filing a ticket.

This can be avoided by configuring a maintenance window when performing an upgrade or a task that may take down this "CRITICAL" piece of the infra.

should continue to send alerts to Slack with Alertmanager, or should I just let OpsGenie forward the alert to Slack?

It depends. One approach can be:

Send WARNING alerts only to Slack
Send CRITICAL alerts to OpsGenie which then sends to Slack

For WARNING alerts, you don't want OpsGenie to call you incessantly during the wee hours. OpsGenie will only bug you if the alert is CRITICAL. Configuring the monitoring system takes quite some fine-tuning to get it right.

youngee91 commented 5 years ago

Did you verify that the alerts work by running, let's say, the stress tool?

I did test the alert by simulating high CPU usage with the command yes > /dev/null and for i in 1 2; do while : ; do : ; done & done where i is the number of cores to simulate. As suggested, I have used the stress tool and the alerts are all working fine.

Install node_exporter in Proxmox. Proxmox is a critical resource so it's not advisable to rely on eyeballing to monitor it.

Noted on this. Node_exporter has been installed on all the Proxmox nodes and exposed its metrics to Prometheus in k8s cluster.

This webserver is a pod in the k8s cluster? It's ok to do that but it must have several replicas. Be aware that putting it there comes with a risk of it going down during k8s upgrades which will result in a barrage of CRITICAL alerts and OpsGenie calling everyone in the team and filing a ticket. This can be avoided by configuring a maintenance window when performing an upgrade or a task that may take down this "CRITICAL" piece of the infra.

Yes, the webserver is a pod in the k8s cluster. Will increase the number of replicas of the webservers. On OpsGenie, It seems like only the enterprise plan has the maintenance function. In the meantime, I am also looking into other incident management tools.

It depends. One approach can be: Send WARNING alerts only to Slack Send CRITICAL alerts to OpsGenie which then sends to Slack

Noted on this. Configured as suggested.

global:
  slack_api_url: 'https://hooks.slack.com/services/<token>'
route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]
  group_wait: 5m
  group_interval: 10m
  routes:
    - match:
        severity: critical
      receiver: opsgenie-notifications
receivers:
- name: 'slack-notifications'
  slack_configs:
    - channel: '#alerts'
      send_resolved: true
      api_url: https://hooks.slack.com/services/<token>
      username: '{{ template "slack.default.username" . }}'
      color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
      title: '{{ template "slack.default.title" . }}'
      title_link: '{{ template "slack.default.titlelink" . }}'
      text: >-
        {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
          *Description:* {{ .Annotations.message }}
          *Runbook:* <{{ .Annotations.runbook_url }}|:spiral_note_pad:>
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}
- name: 'opsgenie-notifications'
  opsgenie_configs:
    - api_key: <token>
      teams: JayCloud

eugene-chow commented 5 years ago

Good work. Thanks!

CloudCommandos / infra

Identify services and endpoints to monitor #3