department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
97 stars 69 forks source link

[Zero Silent Failures SPIKE] Find a VA Form (Lighthouse API connection / migration) #19246

Open FranECross opened 1 week ago

FranECross commented 1 week ago

Description

The following feature needs to be evaluated to determine if it meets the standards for 'zero silent failures'. which is a user-facing transaction that is submitted to the back-end system. If we identify any missing monitoring, etc. from evaluating the checklist, we will file tickets to update implementation.

OCTODE guidance states:

Problem Statement:

Artifacts

User story

AS A I WANT SO THAT

Engineering notes / background

If you need to set up monitoring in DataDog:

Set up monitoring in Datadog

Follow this guidance on endpoint monitoring to get going. Then following the guidance on monitoring performance to get up to speed with Datadog.

Examples

Additional examples

Analytics considerations

Quality / testing notes

Acceptance criteria

Checklist

Start

Monitoring

⚠️ Failure to have endpoint monitoring in place is a blocking QA standard at Staging review as of 9/10/24. If you answered no to any of the questions above, you will be blocked from shipping at the Staging review touchpoint in Collab Cycle.

Reporting errors

Documentation

User experience

Learn how to create a user data flow diagram

File silent errors issues in Github

We don't have any silent errors!

Great! Please let us know that you went through the checklist above as a team and did not find any silent failures in our Slack channel: #zero-silent-failures. You don't have to hang out in there once you have notified us. Just pop in, tell us who you are (which team and in which portfolio) and that no failures were found. Thanks!

jilladams commented 1 week ago

@humancompanion-usds question: Find Forms relies on the LH Forms API, and that API pulls from Drupal via sidekiq job. We are feeling muddy on whether / how that meets the criteria here. We do rely on an API that relies on Sidekiq (or another background job processor), but we do not submit to that API. They pull from us. Curious your thoughts on the ownership of the onus for silent failure, that in mind. Do you happen to know if LH has gotten the same mandate / if they'll be doing similar audits?

humancompanion-usds commented 1 week ago

The most common case of silent failure is when we take data from a user and submit it and that submission fails async. I'm not quite following you you rely on an API that pulls from find a form. If you wouldn't mind pointing me to a data flow diagram, if you have one, that might help me to catch up. There is a template in Mural if you want to create one to demonstrate the flow.

Lighthouse is, for the most part, middleware in that it takes a submission from us and passes it onto another back-end system. When there are failures in those back-end systems Lighthouse returns those to us to manage. To their knowledge they have no errors that they fail to notify us of.

jilladams commented 1 week ago

@humancompanion-usds Sure thing - this is the data flow: https://github.com/department-of-veterans-affairs/va.gov-cms/blob/main/READMES/migrations-forms.md

What happens:

So the proper user-facing silent failure would be limited to search in Find-a-Form, e.g if a user submitted a search and got no results and no error, or got partial results maybe? We handle that and aren't concerned about that area having silent failures.

But: Michelle is concerned about our data integrations, and making sure that data is making it to its end destinations. We are not really sending data from Find a Form, and if Lighthouse is sure that they don't have errors where they fail to notify, then this ticket might be a no-op?

humancompanion-usds commented 1 week ago

I'd recommend ensuring that we have monitoring in place to ensure we know when the data does not make it to an end destination and that you all have a playbook for how to handle those errors when they pop up in monitoring. This aligns with the endpoint monitoring QA standard.

jilladams commented 5 days ago

If we end up needing to talk to Lighthouse about the state of their monitoring, you can find Kristen Brown and Matt Kelly in the #va-forms channel.

chriskim2311 commented 2 days ago

Spoke to Lighthouse folks in va-forms slack: https://dsva.slack.com/archives/CUB5X5MGF/p1727281368978889.

Lighthouse is monitoring the sidekiq jobs and get alerts if the migration job fails. The job continuously tries to migrate for about an hour before being exhausted. The LH team does have monitors in place and gets alerts for failures of this job through their non-VSP LH Datadog monitors.

In terms of FE for users, failures are shown to the end user for various failures(Invalid PDF Link, Invalid PDF accessed, and onDownloadLinkClick error) and monitors will be moved to datadog with Sentry being deprecated as part of this epic: https://github.com/department-of-veterans-affairs/va.gov-cms/issues/18766.

@FranECross @jilladams I don't think there are any other steps needed for this ticket as monitoring is in place on the LH side of things. Let me know if there are any outstanding questions and I can look into them thanks!