department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

Zero Silent Failures - Web Platform #95372

Open jwoodman5 opened 1 month ago

jwoodman5 commented 1 month ago

Status

Update each sprint until completed Date Status Launch Date Notes
11/22/2024 In-progress On-track Sprint 15: Issue with deleted PagerDuty services that caused maintenance windows not to work has been addressed.
11/8/2024 In-progress On-track (Sprint 14) VA is addressing the orphaned jobs and we are addressing an issues with deleted PagerDuty services causing maintenance windows not to work in Sprint 14.
10/25/2024 In-progress On-track (Sprint 14) List of owned/non-owned endpoints/jobs has been created. Communicating non-owned endpoints with VA to resolve. Orphaned jobs

Problem Statement

Silent failures create problems for Veterans because nobody knows, at least within a reasonable time frame, when applications for benefits or other services on VA.gov fail to complete/submit successfully. This can impact submission deadlines and other important processes for Veterans trying to access benefits they've earned. This work is to identify and address any platform-owned jobs that could contribute to creating a silent failure.

How might we ensure when any job or related process (that we own) fails to complete successfully, we are made aware of it in such a way that we can take timely action to address it.

Hypothesis or Bet

If we ensure all jobs related to Veterans completing tasks on VA.gov have error handling that provides proper notification, then there will be fewer issues of submission failures being missed.

If we reduce the amount of missed submission failures for Veterans trying to complete tasks on VA.gov, then Veteran satisfaction with VA.gov will increase

We will know we're done when... ("Definition of Done")

[ ] All platform owned jobs with potential silent failures have been identified and documents [ ] All platform owned jobs with potential silent failures have been addressed to remove risk of silent failure [ ] Any orphaned or otherwise non-platform owned jobs identified in this analysis have been documented [ ] VA leadership has been provide a list of all orphaned or otherwise non-platform owned jobs identified so they can address

Known Blockers/Dependencies

List any blockers or dependencies for this work to be completed

Projected Launch Date

humancompanion-usds commented 3 weeks ago

@jwoodman5 - Do you have silent failures that you are remediating? "ZSF:Incident" is for epics tracking the work to respond to a specific incident of silent failures. I'm realizing that probably wasn't clear in my announcement today.

jwoodman5 commented 2 weeks ago

@jwoodman5 - Do you have silent failures that you are remediating? "ZSF:Incident" is for epics tracking the work to respond to a specific incident of silent failures. I'm realizing that probably wasn't clear in my announcement today.

@humancompanion-usds Ahhh, I guess I misunderstood that part. I will remove. Thanks for the callout.