Open choldgraf opened 2 years ago
I think the discussion in Slack @yuvipanda and I had around https://www.pagerduty.com/ and https://www.atlassian.com/software/opsgenie will be relevant here. The idea being that for major outages, there should be an engineer on-call to respond so that the support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.
support steward isn't the only one who shoulders responsibility for updating the client and exploring the issue.
IMHO, the support steward should not "feel" responsible for exploring the issue (although it might be involved in updating the client), I guess this distinction is actually part of the discussion in #1068.
@damianavila sure, but I've certainly been posting in the Slack channel about outages and just had to do my best until someone came online. I think having someone to page will help that feeling of "I can't do anything more right now".
Given that we have a PR open to define and incident commander and more complex response process:
Should we re-scope this issue to explicitly be about "pager-style" escalation practices? E.g., some system to ping a specific person via a non-Slack method if a particular problem emerges?
Or, should we consider https://github.com/2i2c-org/team-compass/pull/422 to be enough, and we should close this and iterate with that system for a bit and decide if we need something like a dedicated Pager?
Should we re-scope this issue to explicitly be about "pager-style" escalation practices?
Yes! I think this is important as otherwise the 'currently awake' people can feel pretty overwhelmed sometimes.
I think specifically for outages, as a short-term non-scaleable measure, I'm always happy to be alerted via non-slack methods (I think most people have my phone number). I think that's an important senior engineer responsibility.
I think that's an important senior engineer responsibility.
I totally appreciate that @yuvipanda, BUT we need to find a way/process so we do not need to ping you at your personal phone number. So +1 on repurposing this one about the "pager-style" tool.
OK I've re-worked the top comment in this one to focus more around Pager-style updates. Also added some links!
@damianavila totally agree this isn't long term sustainable! I just wanted to volunteer that right now as outages and escalations will continue to happen as we figure out the process.
After #1687, me and @jmunroe are spending effort investigating using PagerDuty primarily for Incident response (not using any of the automated alerting features).
I'm also looking at OpsGenie - In particular, they have a 'quiet hours' features that seems more intuitive than Pagerduty's scheduling. However, the OpsGenie slack integration doesn't let you trigger a new incident from slack which is boo :(
It's absolutely important that engineers can control how and when they are notified - we don't want this to become a traditional 'oncall' situation.
Me and @jmunroe spent a bunch of time talking about this, and in particular role-playing how this should have played out with today's UToronto outage.
A proposed workflow is that after the incident is created, we want pagerduty to evaluate the current local timezone and stated preferences of all engineers and then send everyone who wants to be notified at that point a notification, via methods of their own choosing (SMS, App, Phone call). Engineers can then acknowledge the alert if they are able to provide assistance, or do nothing if they can't. After an hour (or a configurable time period later!), if nobody has acknowledged the alert, it'll automatically escalate to a (opt-in) group of second tier folks who can respond.
We're going to play with pagerduty rules to try make this possible.
However, even without any of this, it still provides value by streamlining the 'incident response' process and removing a medium (GitHub) that currently is a bit of a bottleneck - and instead moving it all to slack.
Removing myself as I'm not currently working on it. #1804 is related though, and I am working on that.
However, to prevent the perfect from being the enemy of the good, please do consider that anyone on the team can always reach out to me at any time to escalate an outage.
Context
Hubs will experience outages of different magnitudes, and these should trigger varying degrees of response from our team. We want to find a balance between sustainable practices for our team, and ensuring that our communities don't feel too much pain from outages.
We have an Incident Commander-style process for handling the roles / communication / etc during incidents. However, we do not yet define a process for escalating alerts and notifications to specific people when two conditions are true:
Proposal
We should define some kind of Pager-style mechanism that can actively ping certain team members during incidents where their time is needed. We should define this process in a way that:
A rough approach is to define an on call engineer that makes themselves available to be actively pinged in the event that an incident is declared. This role would then cycle through our engineering team over time, so that no single team member must respond to incidents too often.
References