department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

Engineering Spike: Failure Messaging Points #93133

Closed matt4su closed 1 month ago

matt4su commented 1 month ago

Issue Description

The purpose of this engineering research effort is to identify and document points in the Pension and Burials application process where failures occur and the messaging (either UX mgs or email notification) that is triggered, or if no messaging is triggered.

Context: Murals for silent failure work that Wayne and Tai put together: https://app.mural.co/t/departmentofveteransaffairs9999/m/departmentofveteransaffair[…]fd80bfd6d93da645b97e5b22984b2?sender=u70c752d5ef17a49c48592418 https://app.mural.co/t/departmentofveteransaffairs9999/m/departmentofveteransaffair[…]24876a531148a12fe71c865dc040f?sender=u70c752d5ef17a49c48592418

Veteran Facing Forms team Design Templates. https://github.com/department-of-veterans-affairs/VA.gov-team-forms/tree/main/Design/patterns/Application%20status


Tasks

Acceptance Criteria

Notes:

May be related to #86428 and #86426

TaiWilkin commented 1 month ago

Pensions

  1. The user submits their claim and there is a problem before or during the Sidekiq job creation (due to a schema validation error, for example).
    1. User: An error is displayed in the UI.
    2. Team: A monitor sends an alert to Slack.
  2. There is an error submitting the claim to Lighthouse, and we retry the submission. If the submission fails repeatedly. It eventually becomes exhausted.
    1. User: NO messaging is sent to the user in this case.
    2. Team: A monitor sends an alert to Slack.
  3. A claim successfully reaches Lighthouse but fails within Lighthouse.
    1. User: NO messaging is sent to the user in this case (by us - does Lighthouse/VBMS message users when claims fail in their system?)
    2. Team: A monitor sends an alert to Slack.

Burials

  1. The user submits their claim and there is a problem before or during the Sidekiq job creation (due to a schema validation error, for example).
    1. User: An error is displayed in the UI.
    2. Team: A monitor sends an alert to Slack.
  2. There is an error in the first Sidekiq job, ProcessDataJob. There are no retries enabled, so this immediately exhausts.
    1. User: NO messaging sent to user.
    2. Team: I see NO monitor for this in DataDog
  3. There is an error in the second Sidekiq job, SubmitBenefitsIntakeClaim, and we retry the submission. If the submission fails repeatedly. It eventually becomes exhausted.
    1. User: NO messaging sent to user.
    2. Team: A monitor sends an alert to Slack.
  4. A claim successfully reaches Lighthouse but fails within Lighthouse.
    1. User: NO messaging is sent to the user in this case (by us - does Lighthouse/VBMS message users when claims fail in their system?)
    2. Team: A monitor sends an alert to Slack.
TaiWilkin commented 1 month ago

@wayne-weibel You recently worked with Burial monitors - can you confirm the team monitoring listed here is correct? Thank you!

aplatt-coforma commented 1 month ago

Per meeting with the team today here are the next steps:

  1. @matt4su - will talk with Dene/Sanja about email failure user experience. Since we've been able to catch and remediate almost all failures for pension and burials, we are concerned about the user experience of sending them an email stating that their application failed to submit but potentially getting manually remediated by the team before they have a chance to log back in.
  2. @TaiWilkin - will create a ticket to refactor the "There is an error in the first Sidekiq job, ProcessDataJob. There are no retries enabled, so this immediately exhausts" for Burials so we don't need that monitor.