Incident Response: Set up alerts for mission-critical infrastructure dependencies

Problem Statement

There are platform systems that depend on external service providers. Issues occur when those services go down. It is hard to respond to mission-critical infrastructure incidents without alerts for these services.

Background / Context

There have been a number of platform incidents or outages that were caused by issues with upstream or downstream dependencies. There isn't good monitoring or alerting in place for these dependencies. This can result platform incidents that are initially reported by platform stakeholders (ie VFS teams, developers, product managers, or veterans themselves).

How might we...

...be alerted to issues with platform infrastructure dependencies? ...provide a better response to incidents that occur on the platform because of issues with external services that the platform depends on?

Hypothesis or Bet

We believe this will help platform support personnel to identify issues with mission-critical platform infrastructure dependencies. We believe this will aid in root cause analysis for platform incidents related to mission-critical platform infrastructure dependencies.

We will know we're done when... ("Definition of Done")

Monitors and alerts are set up for the following infrastructure dependencies:

GitHub Actions
GitHub API quota(s)
VA Network Gateway(s); ie TIC
VA Network Certs
Datadog

Known Blockers/Dependencies

TBD

Projected Launch Date

TBD

Launch Checklist

Is this service / tool / feature...

... tested?

[ ] Usability test (TODO: link) has been performed, to validate that new changes enable users to do what was intended and that these changes don't worsen quality elsewhere. If usability test isn't relevant for this change, document the reason for skipping it.
- [ ] ... and issues discovered in usability testing have been addressed.
- Note on skipping: metrics that show the impact of before/after can be a substitute for usability testing.
[ ] End-to-end manual QA or UAT is complete, to validate there are no high-severity issues before launching
[ ] (if applicable) New functionality has thorough, automated tests running in CI/CD

... documented?

[ ] New documentation is written pursuant to our documentation style guide
[ ] Product is included in the List of VSP Products
- List the existing product that this initiative fits within, or add a new product to this list.
[ ] Internal-facing: there's a Product Outline
[ ] External-facing: a User Guide on Platform Website exists for this product/feature tool
[ ] (if applicable) Post to #vsp-service-design for external communication about this change (e.g. VSP Newsletter, customer-facing meetings)

... measurable

[ ] (if applicable) This change has clearly-defined success metrics, with instrumentation of those analytics where possible, or a reason documented for skipping it.
- For help, see: Analytics team
[ ] This change has an accompanying VSP Initiative Release Plan.

When you're ready to launch...

[ ] Conduct a [go/no-go] (https://vfs.atlassian.net/wiki/spaces/AP/pages/1670938648/Platform+Crew+Office+Hours#Go%2FNo-Go) when you're almost ready to launch.

Required Artifacts

Documentation

PRODUCT_NAME: directory name used for your product documentation
Product Outline: link to Product Outline
User Guide: link to User Guide

Testing

Usability test: link to GitHub issue, or provide reason for skipping
Manual QA: link to GitHub issue or documented results
Automated tests: link to tests, or "N/A"

Measurement

Average mean time to acknowledge (MTTA) mission-critical platform infrastructure incidents is 30 min or less
75% of incidents are successfully identified as coming from platform infrastructure or not
Release plan: Incident Response: Set up alerts for mission-critical infrastructure dependencies: Release Plan

department-of-veterans-affairs / va.gov-team