CloudCommandos / infra

0 stars 0 forks source link

Explore external monitoring and incident-response tools #2

Open eugene-chow opened 5 years ago

eugene-chow commented 5 years ago

Internal monitoring via Prometheus will work fine except for the scenario, hopefully hypothetical, where the entire DC goes down. To be alerted by such a disastrous event, you need external monitoring. See this discussion.

Monitoring can be categorised in 2 ways: Push-based monitoring and Pull-based monitoring. The former approach is more common and is adopted by Nagios, Sensu, TICK stack. Prometheus and Grafana belongs to the latter group. Most likely, you'll need both kinds of monitoring.

With monitoring in place, you will need to have an incident-response tool to manage on-call scheduling, alerts and incident tickets.

Tasks

youngee91 commented 5 years ago

Explore both free and paid incident response tools

Below is the comparison of various incident management tools. Most of the incident management tools are paid. For a small team, OpsGenie free version should have sufficient features to handle the workload. With regards to NgDesk, It is fairly new and has limited information. As of now, I couldn't find any way to integrate NgDesk with the Prometheus.

Incident Management Tool Pros Cons
OpsGenie Free version available
Listed as one of the receivers for Alertmanager config
Interface is easy to navigate
Alerts send to Slack can be Acknowledge etc by users
Mobile app available
Free version has limited features
Slightly difficult to control the schedule
Pagerduty Extensive list of integrations available
Listed as one of the receivers for Alertmanager config
Detailed information available such as report analysis
Mobile app available
No free version available
Interface is slightly complicated to navigate
VictorOps Listed as one of the receiver for Alertmanager config
Various reports such as post-incident review, incident frequency are available in Enterprise version
Routing keys - Route alerts to team who is best suited to resolve it
Mobile app available
No free version available
xMatters Free version available
Mobile app available
Additional steps are required to add Prometheus integration
Interfaces is difficult to navigate
Limited graphical visualisation

Explore SaaS options for pull-based and push-based monitoring

Still in the midst of understanding. Do correct me if I am wrong, to have both push and pull monitoring, the push monitoring will compensate for real-time alert as pull monitoring scrape on a regular interval and might have a chance to miss out a real-time issue?

eugene-chow commented 5 years ago

I'm curious about xMatters. I wanted to explore it but never received the registration email (?!). Please share with me on Wed.

Pull-based monitoring scrapes an interface that is known to be up. If the service is taken down for maintenance, alerts will be generated. That's ok if the alert doesn't trickle upwards to your management. This may be the case in larger teams and you definitely don't want that. Also, this can be mitigated by configuring a maintenance period.

Push-based monitoring sends a "ping" to the cronjob monitoring service every couple of minutes. It's like a heartbeat. Maintenance or not, this job should never be stopped. This ideally should be sent from the internal monitoring server (eg. Prometheus) to the cronjob monitor. If the heartbeat stops, chances are: (a) the job got stopped, (b) the VM that host the cronjob died, (c) the internal monitoring agent died, (d) the network failed. You'll want the cronjob monitor to alert for either (c) or (d).

Another good reason for push-based for this scenario is that company IT policy doesn't permit a SaaS monitor to ping a service that's deep in the cluster. The monitoring server should under normal circumstances not be pingable from the outside. The best way to work around this is for the monitoring server to ping the external service. This is usually permitted by IT policy.

TLDR: There's no single correct answer. The correct answer depends on org's IT needs.