department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

Platform SRE Team: Q3 - Update and improve Pager Duty notification and usage #43749

Open little-oddball opened 2 years ago

little-oddball commented 2 years ago

Through conversation and analysis, it has been determined that Pager Duty needs attention to be more effective in our escalation and notification process. Currently, the service has no real owner, no real guidance or definition of intent and is overwhelmed by notifications.

The purpose for this work is to define how Platform should use Pager Duty, get the escalation chain aligned and updated as well as improve the quality and usefulness of the notifications. A Confluence page has been created that outlines many of the observations, etc. related to the current setup. That page is located here:

https://vfs.atlassian.net/wiki/spaces/~623dda1d761efb0069cef710/pages/2217246751/PagerDuty+Notes

Acceptance Criteria

mchelen-gov commented 2 years ago

Intent of Platform alerting (PD and others):

In general critical issues are any that would affect the confidentiality, integrity, and availability of VA.gov and Platform services.

little-oddball commented 2 years ago

The above from Mike has been added/updated on the Pager Duty notes and will allow further execution of items. With that information, new issues should be created for next steps, etc. and brought through the process accordingly.

jhouse-solvd commented 2 years ago

From a recent post-mortem, the Incident Commander (IC) role should receive mission-critical alerts from PagerDuty as soon as they happen. This means that the IC role should be on the front line of notification for service-impacting events along with SMEs that may be involved with responding to the events.

zachclarity commented 2 years ago

@jhouse-solvd We need to update this Alert summary document and prepare for Kubernetes K8 Changes to issue types. Who should own this document and update it ?

little-oddball commented 2 years ago

@jhouse-solvd We need to update this Alert summary document and prepare for Kubernetes K8 Changes to issue types. Who should own this document and update it ?

I would say this is going to be a larger group effort. We'll need to look at the type/group of alert and have the TLs from the teams work together to identify owner and from there we can refine, etc. as needed.

I look at a major goal here is to look over the alerts and determine which are still needed, which are actual critical and which are not. As we identify the ones that are not we determine their level and route them away from Pager Duty and more to support channel, etc. with the appropriate severity.