CloudCommandos / infra

0 stars 0 forks source link

Identify services and endpoints to monitor #3

Closed eugene-chow closed 5 years ago

eugene-chow commented 5 years ago

This is an on-going task until the infra is deemed to be sufficiently well-monitored.

Tasks

Checks that should be implemented to better ensure the reliability and uptime of your infra.

youngee91 commented 5 years ago

Look into regulating the alerts so that you don't get alert fatigue

We are currently using two methods to regulate the alerts. The first approach is to configure Prometheus rules, which allows you to define the evaluation interval before an alert status is changed from pending to firing.

The next approach is to configure Alertmanager. There are three options available:

Here's our current Alertmanager configuration:

global:
  slack_api_url: 'https://hooks.slack.com/services/<token>'
route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]
  group_wait: 5m
  group_interval: 10m
receivers:
- name: 'slack-notifications'
  slack_configs:
    - channel: '#alerts'
      send_resolved: true
      api_url: https://hooks.slack.com/services/<token>
      username: '{{ template "slack.default.username" . }}'
      color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
      title: '{{ template "slack.default.title" . }}'
      title_link: '{{ template "slack.default.titlelink" . }}'
      text: >-
        {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
          *Description:* {{ .Annotations.message }}
          *Runbook:* <{{ .Annotations.runbook_url }}|:spiral_note_pad:>
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}

The interval will be adjusted to an optimum value as we continune to monitor the cluster.

Checks that should be implemented to better ensure the reliability and uptime of your infra.

With prometheus operator, it comes along with pre-configured configuration that will be monitoring the Kubernetest cluster, such as pods, nodes etc. This stack also comes with Grafana that have dashboards for better visualisation of the Prometheus metrics. However, to monitor external source, additional configurations are required for Prometheus to scrape the metrics.

Our configurations to monitor OPNsense can be found here The main alerts on OPNsense are status of instance, CPU usage, Memory usage and Disk capacity.

  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      message: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
  - alert: CPUThresholdExceeded
    expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{job="firewall-node-exporter",mode="idle"}[5m])) * 100) > 90
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usage is dangerously high"
      message: "This device's CPU usage has exceeded the thresold of 90% with a value of {{ $value }} for 3 minutes."
  - alert: MemoryUsageWarning
    expr:  ((node_memory_size_bytes - (node_memory_free_bytes + node_memory_cache_bytes + node_memory_buffer_bytes) ) / node_memory_size_bytes) * 100  > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Memory usage is high"
      message: "This device's Memory usage has exceeded the thresold of 80% with a value of {{ $value }} for 5 minutes."
  - alert: DiskSpaceWarning
    expr:  100 -  ((node_filesystem_free_bytes  * 100 / node_filesystem_size_bytes))  > 75
    for: 60m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Disk usage is high"
      message: "This device's Disk usage has exceeded the thresold of 75% with a value of {{ $value }}."
youngee91 commented 5 years ago

List the monitoring checks to implement in both external and internal monitoring tools

For internal checks, we focus on these aspects for the Kubernetes cluster, VMs and physical nodes.

Most monitorings are done with Prometheus and physical nodes are monitored with Proxmox GUI.

With regards to external monitoring, we exposed an Nginx web server and used an external monitoring tool to monitor the health of the web server. In the event where the web server goes down, external monitoring will notify us and we can do a further investigation if whether the whole DC goes down.

Among the external monitoring tools, we feel that Freshping is more user friendly, easy to use and we are able to select the test location. In which Singapore is available for selection as the test location. Thus, it will not be blocked by our firewall setting. The downside of Freshping is that currently, we cant monitor our sites with https protocol due to the SSL certificate checks.

Here's a comparison of external monitoring tools

Monitoring Tools (Free) Pros Cons
Uptime Robot Simple UI
Easy to set up
Prompt email notification
Limited features and graphical UI
Freshping User friendly UI
Interval check in 1 or 5 minutes
Able to choose test location to perform the checks
Historical Uptime to 90 days
Able to create public status page on monitored sites
Available to integrate with Slack etc
Unable to ignore SSL Certificate checks
Cula.io User friendly UI
Contect check available
Available to integrate with Slack etc
Limited test location
Fixed check time interval at 2 minutes
StatusCake Easy to set up
Able to integrate with multiple platforms such as Telegram, Ops Genie, Slack etc
Limited features in Free version
Confusing UI
Monitoring Tools (Paid) Pros Cons
Pingdom User friendly UI
Visitor Insight
Multiple features such as speed test etc
Able to schedule planned maintenance work to pause monitoring
Long range of time interval available
Test locations are categorised in region
Uptimia Wide ranges of test locations available
Generates reports to track the overall health of monitors
Root cause analysis available
Minimum of 10 test locations are required
Supermonitoring Easy to set up
Able to integrate with Google Analytics panel
Limited features
Unable to select test location
Monitis Flexible to create your own dashboard
Multiple graphical charts are available
Able to export charts in PDF and CSV format
Intermediate level knowledge required
Binary Canary Escalation Profiles available in the event of monitors failure
Easy to set up
Limited graphical charts for monitoring
Test locations are only available in USA and Ireland
youngee91 commented 5 years ago

Tie up Prometheus with the Incident Response tool and other monitoring tools (big hint here) to ensure that incidents are responded to in a timely fashion.

We have tied up Prometheus with OpsGenie as our incident response tool. For this testing, we deployed an Nginx web server and also monitored it with an external monitoring tool. We did something to the deployment file to crash the pods and the external monitoring tool promptly picked up that the web server is down. Within a few minutes, Prometheus fired the alert and it was sent to OpsGenie, which was also integrated with Slack.

Subsequently, we up the Nginx webs server again and it been reflected promptly on the external monitoring tool, as well as OpsGenie and Slack. With regards to this, may I ask if I should continue to send alerts to Slack with Alertmanager, or should I just let OpsGenie forward the alert to Slack?

eugene-chow commented 5 years ago

Sorry for the late reply. Good work there ;) There you have 360 degree monitoring.

The main alerts on OPNsense are status of instance, CPU usage, Memory usage and Disk capacity.

Did you verify that the alerts work by running, let's say, the stress tool?

physical nodes are monitored with Proxmox GUI

Install node_exporter in Proxmox. Proxmox is a critical resource so it's not advisable to rely on eyeballing to monitor it.

The downside of Freshping is that currently, we cant monitor our sites with https protocol due to the SSL certificate checks.

You can get free certs from Let's Encrypt that should be recognised by Freshping

deployed an Nginx web server and also monitored it with an external monitoring tool

This webserver is a pod in the k8s cluster? It's ok to do that but it must have several replicas. Be aware that putting it there comes with a risk of it going down during k8s upgrades which will result in a barrage of CRITICAL alerts and OpsGenie calling everyone in the team and filing a ticket.

This can be avoided by configuring a maintenance window when performing an upgrade or a task that may take down this "CRITICAL" piece of the infra.

should continue to send alerts to Slack with Alertmanager, or should I just let OpsGenie forward the alert to Slack?

It depends. One approach can be:

  1. Send WARNING alerts only to Slack
  2. Send CRITICAL alerts to OpsGenie which then sends to Slack

For WARNING alerts, you don't want OpsGenie to call you incessantly during the wee hours. OpsGenie will only bug you if the alert is CRITICAL. Configuring the monitoring system takes quite some fine-tuning to get it right.

youngee91 commented 5 years ago

Did you verify that the alerts work by running, let's say, the stress tool?

I did test the alert by simulating high CPU usage with the command yes > /dev/null and for i in 1 2; do while : ; do : ; done & done where i is the number of cores to simulate. As suggested, I have used the stress tool and the alerts are all working fine.

Install node_exporter in Proxmox. Proxmox is a critical resource so it's not advisable to rely on eyeballing to monitor it.

Noted on this. Node_exporter has been installed on all the Proxmox nodes and exposed its metrics to Prometheus in k8s cluster.

This webserver is a pod in the k8s cluster? It's ok to do that but it must have several replicas. Be aware that putting it there comes with a risk of it going down during k8s upgrades which will result in a barrage of CRITICAL alerts and OpsGenie calling everyone in the team and filing a ticket. This can be avoided by configuring a maintenance window when performing an upgrade or a task that may take down this "CRITICAL" piece of the infra.

Yes, the webserver is a pod in the k8s cluster. Will increase the number of replicas of the webservers. On OpsGenie, It seems like only the enterprise plan has the maintenance function. In the meantime, I am also looking into other incident management tools.

It depends. One approach can be: Send WARNING alerts only to Slack Send CRITICAL alerts to OpsGenie which then sends to Slack

Noted on this. Configured as suggested.

global:
  slack_api_url: 'https://hooks.slack.com/services/<token>'
route:
  receiver: 'slack-notifications'
  group_by: [alertname, datacenter, app]
  group_wait: 5m
  group_interval: 10m
  routes:
    - match:
        severity: critical
      receiver: opsgenie-notifications
receivers:
- name: 'slack-notifications'
  slack_configs:
    - channel: '#alerts'
      send_resolved: true
      api_url: https://hooks.slack.com/services/<token>
      username: '{{ template "slack.default.username" . }}'
      color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
      title: '{{ template "slack.default.title" . }}'
      title_link: '{{ template "slack.default.titlelink" . }}'
      text: >-
        {{ range .Alerts }}
          *Alert:* {{ .Labels.alertname }} - `{{ .Labels.severity }}`
          *Description:* {{ .Annotations.message }}
          *Runbook:* <{{ .Annotations.runbook_url }}|:spiral_note_pad:>
          *Details:*
          {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
          {{ end }}
        {{ end }}
- name: 'opsgenie-notifications'
  opsgenie_configs:
    - api_key: <token>
      teams: JayCloud
eugene-chow commented 5 years ago

Good work. Thanks!