department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
284 stars 206 forks source link

Decision Reviews - add custom metrics to track exhaustion failure items #94461

Closed va-albers closed 3 weeks ago

va-albers commented 1 month ago

Value Statement

As an OCTO principal with responsibilities to report metrics on application performance, I want to be aware of all cases where Benefits Portfolio systems result in a silent failure, or require that a veteran is notified of a system failure. So that** we can focus resources on addressing issue, ensure that teams are responding to silent failure cases by having a metric available in Datadog, built in a way that minimizes complexity for the implementing team.


New Feature

  1. In all cases where a Decision Reviews system operations result in a silent failure write out a Datadog metric which captures the issue

    statsd.increment('silent_failure', [tags=[service:"appeal-application",function:"lighthouse upload"])
    • silent failure is defined as any case where a veteran is not aware that a request made by them was not completed.
    • the service tag value should be the application's name from the Datadog Service Catalog list of services.
    • the function tag value should be selected by the team and clearly represent the application function that had the silent failure. This should be the original function that failed (such as "lighthouse evidence upload" or "appeal submission", not a failed notification step ("VANotify").
    • we anticipate there are two cases where this scenario would happen:
    • the system has a silent failure issue and hasn't been updated to notify the Veteran of the issue.
    • the system had a silent failure and the attempt to contact the Veteran to notify them of the issue failed.
  2. In all cases where a BMT system operation would have resulted in a silent failure, but that silent failure was avoided by notifying the Veteran of the issue, write out a Datadog metric which captures the avoided issue

    statsd.increment('silent_failure_avoided', [tags=[service:"appeal-application",function:"lighthouse upload"])
    • The service and function tags should follow the guidelines listed above.
  3. If there are cases where we cannot write these metric note these cases with the ticket creator & product owner. An example of this would be a case where the team only received notification of errors in email, or where an API used by the system does not provide a success/failure response that matches this goal.

Outcome, Success Measure, KPI(S), and Tracking Link

Design

Enablement team (if needed)

@va-albers

Product Owner

@amylai-va

Engineering

Out of scope

Open questions

Tasks

Definition of Done

Acceptance Criteria

pshahVA commented 1 month ago

Hello can we please add a PO and all the appropriate labels as indicated in this mural

comaurice commented 1 month ago

Hello can we please add a PO and all the appropriate labels as indicated in this mural

Added

pshahVA commented 1 month ago

Thanks

shaunburdick commented 1 month ago

platform/practices/zero-silent-failures/logging-silent-failures.md

shaunburdick commented 1 month ago

Service Tag Values:

From: https://vagov.ddog-gov.com/services?query=team%3Abenefits&env=%2A&fromUser=false&lens=Ownership&sort=service&from_ts=1729106691509&start=1729189455627&end=1729193055627&to_ts=1729193091509&live=true

dfong-adh commented 1 month ago

Would something like this be good for 4142 failures?

StatsD.increment('silent_failure', tags: { service: 'supplemental-claims', function: 'lighthouse form 4142 submission' })
dfong-adh commented 1 month ago

PR ready for review: https://github.com/department-of-veterans-affairs/vets-api/pull/19187