Define escalation practices when there are hub outages

choldgraf commented 2 years ago

Context

Hubs will experience outages of different magnitudes, and these should trigger varying degrees of response from our team. We want to find a balance between sustainable practices for our team, and ensuring that our communities don't feel too much pain from outages.

We have an Incident Commander-style process for handling the roles / communication / etc during incidents. However, we do not yet define a process for escalating alerts and notifications to specific people when two conditions are true:

We have an incident that requires immediate attention
A key expert is not available to resolve the incident

Proposal

We should define some kind of Pager-style mechanism that can actively ping certain team members during incidents where their time is needed. We should define this process in a way that:

Is efficient and quickly gets the right information to the right person
Spreads the load across team members in an equitable way
Is realistic about our capacity and promises to uptime/SLAs (AKA, we shouldn't be too hard on ourselves)

A rough approach is to define an on call engineer that makes themselves available to be actively pinged in the event that an incident is declared. This role would then cycle through our engineering team over time, so that no single team member must respond to incidents too often.

References

PagerDuty is one service that does this: https://www.pagerduty.com/
OpsGenie is another: https://www.atlassian.com/software/opsgenie

sgibson91 commented 2 years ago

I think the discussion in Slack @yuvipanda and I had around https://www.pagerduty.com/ and https://www.atlassian.com/software/opsgenie will be relevant here. The idea being that for major outages, there should be an engineer on-call to respond so that the support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.

damianavila commented 2 years ago

support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.

IMHO, the support steward should not "feel" responsible for exploring the issue (although it might be involved in updating the client), I guess this distinction is actually part of the discussion in #1068.

sgibson91 commented 2 years ago

@damianavila sure, but I've certainly been posting in the Slack channel about outages and just had to do my best until someone came online. I think having someone to page will help that feeling of "I can't do anything more right now".

choldgraf commented 2 years ago

Given that we have a PR open to define and incident commander and more complex response process:

https://github.com/2i2c-org/team-compass/pull/422

Should we re-scope this issue to explicitly be about "pager-style" escalation practices? E.g., some system to ping a specific person via a non-Slack method if a particular problem emerges?

Or, should we consider https://github.com/2i2c-org/team-compass/pull/422 to be enough, and we should close this and iterate with that system for a bit and decide if we need something like a dedicated Pager?

yuvipanda commented 2 years ago

Should we re-scope this issue to explicitly be about "pager-style" escalation practices?

Yes! I think this is important as otherwise the 'currently awake' people can feel pretty overwhelmed sometimes.

I think specifically for outages, as a short-term non-scaleable measure, I'm always happy to be alerted via non-slack methods (I think most people have my phone number). I think that's an important senior engineer responsibility.

damianavila commented 2 years ago

I think that's an important senior engineer responsibility.

I totally appreciate that @yuvipanda, BUT we need to find a way/process so we do not need to ping you at your personal phone number. So +1 on repurposing this one about the "pager-style" tool.

choldgraf commented 2 years ago

OK I've re-worked the top comment in this one to focus more around Pager-style updates. Also added some links!

yuvipanda commented 2 years ago

@damianavila totally agree this isn't long term sustainable! I just wanted to volunteer that right now as outages and escalations will continue to happen as we figure out the process.

yuvipanda commented 2 years ago

After #1687, me and @jmunroe are spending effort investigating using PagerDuty primarily for Incident response (not using any of the automated alerting features).

Stage 1: Incident Response

[x] Describe process for initiating incidents https://github.com/2i2c-org/team-compass/pull/508
[ ] Describe process for switching incident commanders
[ ] Describe process for marking incidents as resolved
[x] Describe process for using PagerDuty's incident report process https://response.pagerduty.com/after/post_mortem_process/

Stage 2: Escalation

[ ] Describe process for escalating issues when needed, describing a separate escalation team
[ ] Describe how our engineers should have local notifications set up (pagerduty app? sms? calls?) and describe the specific cases where they might get notified - I think if you aren't part of the escalation team you should never actually get paged via pagerduty.

yuvipanda commented 2 years ago

I'm also looking at OpsGenie - In particular, they have a 'quiet hours' features that seems more intuitive than Pagerduty's scheduling. However, the OpsGenie slack integration doesn't let you trigger a new incident from slack which is boo :(

yuvipanda commented 2 years ago

It's absolutely important that engineers can control how and when they are notified - we don't want this to become a traditional 'oncall' situation.

yuvipanda commented 2 years ago

Me and @jmunroe spent a bunch of time talking about this, and in particular role-playing how this should have played out with today's UToronto outage.

A proposed workflow is that after the incident is created, we want pagerduty to evaluate the current local timezone and stated preferences of all engineers and then send everyone who wants to be notified at that point a notification, via methods of their own choosing (SMS, App, Phone call). Engineers can then acknowledge the alert if they are able to provide assistance, or do nothing if they can't. After an hour (or a configurable time period later!), if nobody has acknowledged the alert, it'll automatically escalate to a (opt-in) group of second tier folks who can respond.

We're going to play with pagerduty rules to try make this possible.

However, even without any of this, it still provides value by streamlining the 'incident response' process and removing a medium (GitHub) that currently is a bit of a bottleneck - and instead moving it all to slack.

yuvipanda commented 2 years ago

Removing myself as I'm not currently working on it. #1804 is related though, and I am working on that.

yuvipanda commented 2 years ago

However, to prevent the perfect from being the enemy of the good, please do consider that anyone on the team can always reach out to me at any time to escalate an outage.

damianavila commented 1 year ago

2i2c-org / infrastructure