Closed timcosgrove closed 2 years ago
note: "oncall engineers" = CMS Platform oncall rotation
Hey team! Please add your planning poker estimate with ZenHub @amponce @cweagans @ElijahLynn @indytechcook @ndouglas @olivereri @timcosgrove
Per refinement just now, we agreed to go straight to PagerDuty (not to Datadog, then PD).
Current escalation policy to Tier 2 on-call is 15 minutes:
Here is an example of an Incident in PagerDuty, there is an "acknowledged" message that comes from the on-call person (but can be ack'd by anyone). The first message got updated to "resolved" after a bit, but initially said "unresolved" or "incident". This all happened right away when the first tier on-call got notified.
https://dsva.slack.com/archives/CT4GZBM8F/p1639571500388000
We may need to work with the Infra team to do some of the admin tasks to integrate with Slack within Pagerduty. That is what I had to do last time we set it up.
Previously failure notification was Github Actions (GHA) posting a message directly to the #vfs-platform-builds channel on Slack. This was changed to post an Event to Datadog via its events API: https://docs.datadoghq.com/api/latest/events/
Here is an example event: https://app.datadoghq.com/event/event?id=6356879810337751513
Since it is very possible that event is no longer retained, this is a screenshot of the event information:
The event provides a link back to the build. Events from the same content-build run are aggregated so that they will not trigger separate incidents.
A Datadog monitor is set up to look for these events. If the monitor sees any error/failure events, it notifies the PagerDuty service @CMS_Engineers_Critical. This monitor can be seen here: https://app.datadoghq.com/monitors/62075516
The monitor passes along the event information to PagerDuty. It sets the incident with High priority.
Once it receives the message, PagerDuty notifies the primary engineer on call for the CMS Engineers Critical / Noncritical escalation policy: https://dsva.pagerduty.com/escalation_policies#PW3ZKRA
The primary engineer is notified through mechanisms of their choice. If the primary engineer does not acknowledge within 15 minutes, the secondary engineer on call will be notified.
At the time of incident creation, PagerDuty also sends a message to Slack. Currently this is sent to the #cms-notifications channel. This is a new change - PagerDuty Slack notifications previously went to #cms-team. However, the channel collectively stated that it was preferable for the noise to be elsewhere.
The integration of Slack & PagerDuty gives anyone who desires it insight into the status of the incident - whether it's yet to be acknowledged, or is acknowledged, or is resolved. Engineers with PagerDuty access can also manage the alert directly from Slack.
Notification of content release run starts now go to #cms-notifications.
Notification of broken link issues now go to #cms-notifications.
There is some discussion about what is the most appropriate place for these to be collected. The location may change.
Description
Currently, when hourly content release fails within business hours (8am-5pm EST), it signals this by sending a message to the DSVA Slack channel #vfs-platform-builds with
@here
, alerting everyone in the channel. We would like to transition content release failures to a more formal PagerDuty policy, which looks something like:upon content release failure:
@here
in #vfs-platform-builds. It should be sufficiently broad that someone is guaranteed to respond to this 3rd notification.Acceptance Criteria
Implementation
CMS Team
Please leave only the team that will do this work selected. If you're not sure, it's fine to leave both selected.
Platform CMS Team
Sitewide CMS Team