Open FranECross opened 1 week ago
@humancompanion-usds question: Find Forms relies on the LH Forms API, and that API pulls from Drupal via sidekiq job. We are feeling muddy on whether / how that meets the criteria here. We do rely on an API that relies on Sidekiq (or another background job processor), but we do not submit to that API. They pull from us. Curious your thoughts on the ownership of the onus for silent failure, that in mind. Do you happen to know if LH has gotten the same mandate / if they'll be doing similar audits?
The most common case of silent failure is when we take data from a user and submit it and that submission fails async. I'm not quite following you you rely on an API that pulls from find a form. If you wouldn't mind pointing me to a data flow diagram, if you have one, that might help me to catch up. There is a template in Mural if you want to create one to demonstrate the flow.
Lighthouse is, for the most part, middleware in that it takes a submission from us and passes it onto another back-end system. When there are failures in those back-end systems Lighthouse returns those to us to manage. To their knowledge they have no errors that they fail to notify us of.
@humancompanion-usds Sure thing - this is the data flow: https://github.com/department-of-veterans-affairs/va.gov-cms/blob/main/READMES/migrations-forms.md
What happens:
So the proper user-facing silent failure would be limited to search in Find-a-Form, e.g if a user submitted a search and got no results and no error, or got partial results maybe? We handle that and aren't concerned about that area having silent failures.
But: Michelle is concerned about our data integrations, and making sure that data is making it to its end destinations. We are not really sending data from Find a Form, and if Lighthouse is sure that they don't have errors where they fail to notify, then this ticket might be a no-op?
I'd recommend ensuring that we have monitoring in place to ensure we know when the data does not make it to an end destination and that you all have a playbook for how to handle those errors when they pop up in monitoring. This aligns with the endpoint monitoring QA standard.
If we end up needing to talk to Lighthouse about the state of their monitoring, you can find Kristen Brown and Matt Kelly in the #va-forms channel.
Spoke to Lighthouse folks in va-forms slack: https://dsva.slack.com/archives/CUB5X5MGF/p1727281368978889.
Lighthouse is monitoring the sidekiq jobs and get alerts if the migration job fails. The job continuously tries to migrate for about an hour before being exhausted. The LH team does have monitors in place and gets alerts for failures of this job through their non-VSP LH Datadog monitors.
In terms of FE for users, failures are shown to the end user for various failures(Invalid PDF Link, Invalid PDF accessed, and onDownloadLinkClick error) and monitors will be moved to datadog with Sentry being deprecated as part of this epic: https://github.com/department-of-veterans-affairs/va.gov-cms/issues/18766.
@FranECross @jilladams I don't think there are any other steps needed for this ticket as monitoring is in place on the LH side of things. Let me know if there are any outstanding questions and I can look into them thanks!
Description
The following feature needs to be evaluated to determine if it meets the standards for 'zero silent failures'. which is a user-facing transaction that is submitted to the back-end system. If we identify any missing monitoring, etc. from evaluating the checklist, we will file tickets to update implementation.
OCTODE guidance states:
Problem Statement:
Artifacts
User story
AS A I WANT SO THAT
Engineering notes / background
If you need to set up monitoring in DataDog:
Set up monitoring in Datadog
Follow this guidance on endpoint monitoring to get going. Then following the guidance on monitoring performance to get up to speed with Datadog.
Examples
Additional examples
Analytics considerations
Quality / testing notes
Acceptance criteria
Checklist
Start
Monitoring
⚠️ Failure to have endpoint monitoring in place is a blocking QA standard at Staging review as of 9/10/24. If you answered no to any of the questions above, you will be blocked from shipping at the Staging review touchpoint in Collab Cycle.
Reporting errors
Documentation
User experience
[ ] Do you capture all of the potential points of failure and make those errors known to the user via email notification and/or through the application on VA.gov or the mobile application?
[ ] Create a user data flow diagram
Learn how to create a user data flow diagram
File silent errors issues in Github
We don't have any silent errors!
Great! Please let us know that you went through the checklist above as a team and did not find any silent failures in our Slack channel: #zero-silent-failures. You don't have to hang out in there once you have notified us. Just pop in, tell us who you are (which team and in which portfolio) and that no failures were found. Thanks!