Explore external monitoring and incident-response tools

eugene-chow commented 5 years ago

Internal monitoring via Prometheus will work fine except for the scenario, hopefully hypothetical, where the entire DC goes down. To be alerted by such a disastrous event, you need external monitoring. See this discussion.

Monitoring can be categorised in 2 ways: Push-based monitoring and Pull-based monitoring. The former approach is more common and is adopted by Nagios, Sensu, TICK stack. Prometheus and Grafana belongs to the latter group. Most likely, you'll need both kinds of monitoring.

With monitoring in place, you will need to have an incident-response tool to manage on-call scheduling, alerts and incident tickets.

Tasks

[ ] Explore SaaS options for pull-based and push-based monitoring
- [ ] Share why both approaches are necessary.
- [ ] Document the pros and cons of the products reviewed
[x] Explore both free and paid incident response tools
- [x] Document the pros and cons of the products reviewed

youngee91 commented 5 years ago

Explore both free and paid incident response tools

Below is the comparison of various incident management tools. Most of the incident management tools are paid. For a small team, OpsGenie free version should have sufficient features to handle the workload. With regards to NgDesk, It is fairly new and has limited information. As of now, I couldn't find any way to integrate NgDesk with the Prometheus.

Incident Management Tool	Pros	Cons
OpsGenie	Free version available Listed as one of the receivers for Alertmanager config Interface is easy to navigate Alerts send to Slack can be Acknowledge etc by users Mobile app available	Free version has limited features Slightly difficult to control the schedule
Pagerduty	Extensive list of integrations available Listed as one of the receivers for Alertmanager config Detailed information available such as report analysis Mobile app available	No free version available Interface is slightly complicated to navigate
VictorOps	Listed as one of the receiver for Alertmanager config Various reports such as post-incident review, incident frequency are available in Enterprise version Routing keys - Route alerts to team who is best suited to resolve it Mobile app available	No free version available
xMatters	Free version available Mobile app available	Additional steps are required to add Prometheus integration Interfaces is difficult to navigate Limited graphical visualisation

Explore SaaS options for pull-based and push-based monitoring

Still in the midst of understanding. Do correct me if I am wrong, to have both push and pull monitoring, the push monitoring will compensate for real-time alert as pull monitoring scrape on a regular interval and might have a chance to miss out a real-time issue?

eugene-chow commented 5 years ago

I'm curious about xMatters. I wanted to explore it but never received the registration email (?!). Please share with me on Wed.

Pull-based monitoring scrapes an interface that is known to be up. If the service is taken down for maintenance, alerts will be generated. That's ok if the alert doesn't trickle upwards to your management. This may be the case in larger teams and you definitely don't want that. Also, this can be mitigated by configuring a maintenance period.

Push-based monitoring sends a "ping" to the cronjob monitoring service every couple of minutes. It's like a heartbeat. Maintenance or not, this job should never be stopped. This ideally should be sent from the internal monitoring server (eg. Prometheus) to the cronjob monitor. If the heartbeat stops, chances are: (a) the job got stopped, (b) the VM that host the cronjob died, (c) the internal monitoring agent died, (d) the network failed. You'll want the cronjob monitor to alert for either (c) or (d).

Another good reason for push-based for this scenario is that company IT policy doesn't permit a SaaS monitor to ping a service that's deep in the cluster. The monitoring server should under normal circumstances not be pingable from the outside. The best way to work around this is for the monitoring server to ping the external service. This is usually permitted by IT policy.

TLDR: There's no single correct answer. The correct answer depends on org's IT needs.

CloudCommandos / infra

Explore external monitoring and incident-response tools #2