There are platform systems that depend on external service providers. Issues occur when those services go down. It is hard to respond to mission-critical infrastructure incidents without alerts for these services.
Background / Context
There have been a number of platform incidents or outages that were caused by issues with upstream or downstream dependencies. There isn't good monitoring or alerting in place for these dependencies. This can result platform incidents that are initially reported by platform stakeholders (ie VFS teams, developers, product managers, or veterans themselves).
How might we...
...be alerted to issues with platform infrastructure dependencies?
...provide a better response to incidents that occur on the platform because of issues with external services that the platform depends on?
Hypothesis or Bet
We believe this will help platform support personnel to identify issues with mission-critical platform infrastructure dependencies.
We believe this will aid in root cause analysis for platform incidents related to mission-critical platform infrastructure dependencies.
We will know we're done when... ("Definition of Done")
Monitors and alerts are set up for the following infrastructure dependencies:
GitHub Actions
GitHub API quota(s)
VA Network Gateway(s); ie TIC
VA Network Certs
Datadog
Known Blockers/Dependencies
TBD
Projected Launch Date
TBD
Launch Checklist
Is this service / tool / feature...
... tested?
[ ] Usability test (TODO: link) has been performed, to validate that new changes enable users to do what was intended and that these changes don't worsen quality elsewhere. If usability test isn't relevant for this change, document the reason for skipping it.
[ ] ... and issues discovered in usability testing have been addressed.
Note on skipping: metrics that show the impact of before/after can be a substitute for usability testing.
[ ] End-to-end manual QA or UAT is complete, to validate there are no high-severity issues before launching
[ ] (if applicable) New functionality has thorough, automated tests running in CI/CD
[ ] (if applicable) Post to #vsp-service-design for external communication about this change (e.g. VSP Newsletter, customer-facing meetings)
... measurable
[ ] (if applicable) This change has clearly-defined success metrics, with instrumentation of those analytics where possible, or a reason documented for skipping it.
Problem Statement
There are platform systems that depend on external service providers. Issues occur when those services go down. It is hard to respond to mission-critical infrastructure incidents without alerts for these services.
Background / Context
There have been a number of platform incidents or outages that were caused by issues with upstream or downstream dependencies. There isn't good monitoring or alerting in place for these dependencies. This can result platform incidents that are initially reported by platform stakeholders (ie VFS teams, developers, product managers, or veterans themselves).
How might we...
...be alerted to issues with platform infrastructure dependencies? ...provide a better response to incidents that occur on the platform because of issues with external services that the platform depends on?
Hypothesis or Bet
We believe this will help platform support personnel to identify issues with mission-critical platform infrastructure dependencies. We believe this will aid in root cause analysis for platform incidents related to mission-critical platform infrastructure dependencies.
We will know we're done when... ("Definition of Done")
Monitors and alerts are set up for the following infrastructure dependencies:
Known Blockers/Dependencies
TBD
Projected Launch Date
TBD
Launch Checklist
Is this service / tool / feature...
... tested?
... documented?
... measurable
When you're ready to launch...
Required Artifacts
Documentation
PRODUCT_NAME
: directory name used for your product documentationTesting
Measurement
Average mean time to acknowledge (MTTA) mission-critical platform infrastructure incidents is 30 min or less
75% of incidents are successfully identified as coming from platform infrastructure or not
Release plan: Incident Response: Set up alerts for mission-critical infrastructure dependencies: Release Plan