department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
280 stars 195 forks source link

Incident Response: Set up alerts for mission-critical infrastructure dependencies #35043

Open jhouse-solvd opened 2 years ago

jhouse-solvd commented 2 years ago

Problem Statement

There are platform systems that depend on external service providers. Issues occur when those services go down. It is hard to respond to mission-critical infrastructure incidents without alerts for these services.

Background / Context

There have been a number of platform incidents or outages that were caused by issues with upstream or downstream dependencies. There isn't good monitoring or alerting in place for these dependencies. This can result platform incidents that are initially reported by platform stakeholders (ie VFS teams, developers, product managers, or veterans themselves).

How might we...

...be alerted to issues with platform infrastructure dependencies? ...provide a better response to incidents that occur on the platform because of issues with external services that the platform depends on?

Hypothesis or Bet

We believe this will help platform support personnel to identify issues with mission-critical platform infrastructure dependencies. We believe this will aid in root cause analysis for platform incidents related to mission-critical platform infrastructure dependencies.

We will know we're done when... ("Definition of Done")

Monitors and alerts are set up for the following infrastructure dependencies:

Known Blockers/Dependencies

TBD

Projected Launch Date

TBD

Launch Checklist

Is this service / tool / feature...

... tested?

... documented?

... measurable

When you're ready to launch...

Required Artifacts

Documentation

Testing

Measurement

jhouse-solvd commented 1 year ago

This should be evaluated in relation to #47025. It's possible much of this will be covered by that initiative.

cc: @ph-One