Transition notification of content release failures to PagerDuty

timcosgrove commented 2 years ago

Description

Currently, when hourly content release fails within business hours (8am-5pm EST), it signals this by sending a message to the DSVA Slack channel #vfs-platform-builds with @here, alerting everyone in the channel. We would like to transition content release failures to a more formal PagerDuty policy, which looks something like:

upon content release failure:

Primary CMS Platform on-call engineer is notified via PagerDuty. A message should also go to Slack to report the failure and indicate the primary on-call engineer has been alerted.
If the primary CMS Platform on-call does not acknowledge the failure within X minutes (amount to be discussed, probably 10 minutes), the secondary on-call engineer is alerted.
If the secondary CMS Platform on-call engineer does not acknowledge within 10 minutes of their notification, a message goes to a wider group of people. This could be similar to what we have now, i.e. @here in #vfs-platform-builds. It should be sufficiently broad that someone is guaranteed to respond to this 3rd notification.

Acceptance Criteria

[ ] hourly content release failure notifications alert CMS Platform on-call engineers via PagerDuty
[ ] the alert notifies the primary CMS Platform on-call engineer first
[ ] if there is no acknowledgement after a defined amount of time, the secondary on-call engineer is notified
[ ] if there is no acknowledgement of the second notification after a defined amount of time, a wider group is notified of the failure (#vfs-platform-builds)

Implementation

verify with Ops there is an existing PagerDuty to Slack pattern (confirmed)
ensure notification goes to correct service within PagerDuty
notify relevant stakeholders of new change

CMS Team

Please leave only the team that will do this work selected. If you're not sure, it's fine to leave both selected.

[x] Platform CMS Team
[ ] Sitewide CMS Team

mchelen-gov commented 2 years ago

note: "oncall engineers" = CMS Platform oncall rotation

jkalexander7 commented 2 years ago

Hey team! Please add your planning poker estimate with ZenHub @amponce @cweagans @ElijahLynn @indytechcook @ndouglas @olivereri @timcosgrove

ElijahLynn commented 2 years ago

Per refinement just now, we agreed to go straight to PagerDuty (not to Datadog, then PD).

cweagans commented 2 years ago

https://github.com/marketplace/actions/pagerduty-alert

ElijahLynn commented 2 years ago

Current escalation policy to Tier 2 on-call is 15 minutes:

https://dsva.pagerduty.com/escalation_policies#PW3ZKRA

ElijahLynn commented 2 years ago

Here is an example of an Incident in PagerDuty, there is an "acknowledged" message that comes from the on-call person (but can be ack'd by anyone). The first message got updated to "resolved" after a bit, but initially said "unresolved" or "incident". This all happened right away when the first tier on-call got notified.

https://dsva.slack.com/archives/CT4GZBM8F/p1639571500388000

ElijahLynn commented 2 years ago

We may need to work with the Infra team to do some of the admin tasks to integrate with Slack within Pagerduty. That is what I had to do last time we set it up.

timcosgrove commented 2 years ago

Implementation

Failure notification

Previously failure notification was Github Actions (GHA) posting a message directly to the #vfs-platform-builds channel on Slack. This was changed to post an Event to Datadog via its events API: https://docs.datadoghq.com/api/latest/events/

Here is an example event: https://app.datadoghq.com/event/event?id=6356879810337751513 Since it is very possible that event is no longer retained, this is a screenshot of the event information:

The event provides a link back to the build. Events from the same content-build run are aggregated so that they will not trigger separate incidents.

Datadog monitor

A Datadog monitor is set up to look for these events. If the monitor sees any error/failure events, it notifies the PagerDuty service @CMS_Engineers_Critical. This monitor can be seen here: https://app.datadoghq.com/monitors/62075516

The monitor passes along the event information to PagerDuty. It sets the incident with High priority.

PagerDuty setup

Once it receives the message, PagerDuty notifies the primary engineer on call for the CMS Engineers Critical / Noncritical escalation policy: https://dsva.pagerduty.com/escalation_policies#PW3ZKRA

The primary engineer is notified through mechanisms of their choice. If the primary engineer does not acknowledge within 15 minutes, the secondary engineer on call will be notified.

At the time of incident creation, PagerDuty also sends a message to Slack. Currently this is sent to the #cms-notifications channel. This is a new change - PagerDuty Slack notifications previously went to #cms-team. However, the channel collectively stated that it was preferable for the noise to be elsewhere.

The integration of Slack & PagerDuty gives anyone who desires it insight into the status of the incident - whether it's yet to be acknowledged, or is acknowledged, or is resolved. Engineers with PagerDuty access can also manage the alert directly from Slack.

Other changes

Notification of content release run starts now go to #cms-notifications.

Notification of broken link issues now go to #cms-notifications.

There is some discussion about what is the most appropriate place for these to be collected. The location may change.

department-of-veterans-affairs / va.gov-cms