[Zero silent failures SPIKE] Facilities migration push to Lighthouse

jilladams commented 2 months ago

Description

The following features needs to be evaluated to determine if it meets the standards for 'zero silent failures', which is a user-facing transaction that is submitted to the back-end system.

OCTODE guidance states:

Does your application have a user-facing transaction that is submitted to a back-end system?
- NOTE: This is not limited to online forms! Other examples can include:
- Uploads of documents and/or attachments
- Performing an action (Such as refilling a prescription or ordering supplies)
Are you using any of the listed APIs?
- Lighthouse Appeals Status
- Lighthouse Benefits Documents API
- Lighthouse Benefits Intake API / Central Mail
- Lighthouse Decision Reviews
- EVSS Document Upload
Does your application submit to an API that relies on Sidekiq (or another background job processor)?

Problem Statement:

See Canvas in Slack

Artifacts

User story

AS A I WANT SO THAT

Engineering notes / background

If you need to set up monitoring in DataDog:

Set up monitoring in Datadog

Follow this guidance on endpoint monitoring to get going. Then following the guidance on monitoring performance to get up to speed with Datadog.

Examples

Benefits Intake API Form Submission Tracking
Banana Peels (Lighthouse Appeals/Forms/Benefits Intake APIs)
Benefits Sidekiq Death Queue view - shows recently exhausted Sidekiq jobs.

Additional examples

Analytics considerations

Quality / testing notes

Acceptance criteria

Checklist

Start

[x] Do you know when your application shipped to production?
- Facility status: October 22, 2021
- Facility services: May 30, 2023
[x] Did your application use the same APIs when it shipped as it does today?
- Yes, it uses the same API but it uses a new version of the API (v1, instead of v0). Issues that arose during the transition have been resolved.

Monitoring

[ ] Do you monitor the API that you submit to via Datadog?
- No. We submit data programmatically from the CMS to Lighthouse's Facilities API, and they also pull data from the CMS. Lighthouse monitors its own API, and the CMS owns uptime monitoring of the CMS. The data migration being reviewed in this ticket runs on cron and therefore isn't really something that can be monitored in Datadog.
[ ] Does your Datadog monitoring use the appropriate tagging?
- N/A (see above)
[ ] Do errors detected by Datadog go into a Slack notifications channel?
- N/A (see above)
[ ] Does more than one person look at the Slack notifications channel containing errors on a daily basis?
- N/A (see above)
[ ] Do the team members monitoring the Slack channel have a system for acknowledging and responding to the errors that appear there?
- N/A (see above) ⚠️ Failure to have endpoint monitoring in place is a blocking QA standard at Staging review as of 9/10/24. If you answered no to any of the questions above, you will be blocked from shipping at the Staging review touchpoint in Collab Cycle.

Reporting errors

[ ] Have you filed issues for errors that are appearing in Datadog / Slack?
- N/A
[ ] Do all fatal errors thrown in your application end up visible to the end user either in the user interface or via email?
- This data migration on cron is not being monitored in the standard way and doesn't have front-end users, so this doesn't apply.

Documentation

[x] Do you have a diagram of the submission path that user data your application accepts takes to reach a system of record?
- Yes - https://github.com/department-of-veterans-affairs/va.gov-cms/blob/main/READMES/migrations-facility.md#data-flow
[ ] Do you understand how the error is handled when each system in the submission path fails, is down for maintenance, or is completely down?
- If not, then create documentation that captures how errors in each system are handled. Detail which systems retry a submission and what happens when those retries exhaust. Show this in your diagram.

RESPONSE NEEDED - #19369 will notify if calls to the API fail entirely.

[x] Has the owner of the system of record receiving the user's data indicated in writing that their system notifies or resolves 100% of fatal errors once in their custody?
- Again, we aren't dealing with user-submitted data. But for the data we are submitting, we do receive API error responses when a call fails, and 200 responses for a partial success call. We will be building an audit system to ensure 1:1 send/receipt: #19366

User experience

[x] Do you capture all of the potential points of failure and make those errors known to the user via email notification and/or through the application on VA.gov or the mobile application?
- N/A
[x] Create a user data flow diagram
- https://github.com/department-of-veterans-affairs/va.gov-cms/blob/main/READMES/migrations-facility.md#diagram-of-current-process-for-pushing-facility-data

File silent errors issues in Github

We want to know about your silent errors so that we can help you to fix them. To do this, follow the process in the Managing Errors document.

We don't have any silent errors!

Great! Please let us know that you went through the checklist above as a team and did not find any silent failures in our Slack channel: #zero-silent-failures. You don't have to hang out in there once you have notified us. Just pop in, tell us who you are (which team and in which portfolio) and that no failures were found. Thanks!

jilladams commented 2 months ago

If we identify gaps here while reviewing the checklist, we need to review with Lighthouse re: the contract / what we can expect from the API.

omahane commented 1 month ago

I have diagrammed the way we push data to Lighthouse and have highlighted the steps during which we notify the team about issues (using bold font, thicker lines, and an orange fill).

When the diagram says "Successfully received by Lighthouse," the response from Lighthouse, at present, doesn't represent any comparison of the data sent and data received but is instead a determination that any payload was received (more on this in a subsequent comment).

Suggested additional notification

One area of opportunity is the "200 but response long" condition, which could trigger a message. The item goes back in the queue, so it's not lost, but should there be a persistent issue with the TIC, which is the only time we've seen this condition, we wouldn't know about it via the current process.

Diagram of current process for pushing facility data

flowchart TD
    A(Editor updates facility or adds or updates service in CMS) --> B(CMS adds node-related payload item to queue)
    B --> C(CMS processes Post API Queue queue on cron)
    C --> D{Is the queue empty or has it reached item processing limit?}
    D -- Yes -->E(QueueProcessingCompleteEvent triggered)
    D -- No --> F(Sends a queue item to Lighthouse)
    E --> G{Does count of items at end equal count at start?}
    F --> H{Successfully received by Lighthouse?}
    G -- Yes --> I(Send error to Slack--or dblogs if no Slack)
    style I stroke-width:4px,font-weight:bold,fill:#f96;
    G -- No --> J(((End queue processing)))
    H -- Yes -->K(Item removed from the queue)
    H -- No --> L{What is the issue?}
    I --> J
    K --> D
    L -- 200 but response long -->B
    L -- 201 or 202 --> M(Send error to Slack and dblogs)
    style M stroke-width:4px,font-weight:bold,fill:#f96;
    L --> D

Additional proposed changes will be described in a subsequent comment.

omahane commented 1 month ago

After meeting with Lighthouse on 27 Sept, we propose prioritizing a scheduled audit with the possible follow-up for a real-time verification.

Scheduled audit

Lighthouse

Creates a new endpoint that can be queried by facility id to allow us to compare CMS data received (with as little “Lighthouse-ification transforming the data” [Adam’s words] as possible) to the data in the CMS database.

Facilities

Creates a cron-based scheduled audit of facilities to determine that facilities data matches the data provided by the new Lighthouse endpoint.

Concerns

This audit could take a while to run and may be resource-intensive.

Real-time verification

Lighthouse

Adds to the data push response the data received

Facilities

Updates the Post API queue code to check the response from Lighthouse and compare the data in the response to the data sent.

Concerns

This real-time verification does not help if we have not sent the data in the first place (such as when we turned on Mental Health services but had not sent them en masse to Lighthouse).

mmiddaugh commented 1 month ago

@omahane to confirm my understanding:

implementing periodic service audits would allow us to confirm that the data available in the LH API matches that in Drupal but we'd have a delay in identifying any discrepancies - our confidence would be as of the last audit
We could increase our time to awareness of a discrepancy by adding real-time verification Correct?

Is there any way to avoid scenarios like the mental health service situation? Could the push be automated?

omahane commented 1 month ago

In a way the real-time verification is not super useful. It only tells us about individual items that have been sent. And we don't have any evidence that this has really ever been the cause of any issues.

The issue with mental health is that it was never sent en masse after changing the code. What we REALLY should have done for that effort was to make one of the ACs that all mental health data is pushed and confirmed that via spot check on prod.

We did confirm with Adam that they were being received, but didn't including the bulk send in our ACs.

The push could have been automated by writing a post-deploy hook save all the mental health content. That would have been one way to do it en masse if we didn't want to do a bulk edit. That we didn't capture this expectation in the ACs, though, was the root cause in this case, an oversight of the team.

omahane commented 1 month ago

In terms of a cadence for the audit, I would want to see what the resource load is to determine what it would take in terms of time. I would say that we could ideally do it weekly, though monthly might be often enough.

omahane commented 1 month ago

I have moved the diagram over to https://github.com/department-of-veterans-affairs/va.gov-cms/blob/main/READMES/migrations-facility.md

@mmiddaugh While I've made tickets for the audit and 200 issue when TIC is blocking us, I haven't made one for the real-time verification, as I am not sure that's it's something we need, given glaring blind-spot of it:

This real-time verification does not help if we have not sent the data in the first place (such as when we turned on Mental Health services but had not sent them en masse to Lighthouse).

Plus, the audit should be doing the heavy lifting of notifying us of discrepancies.

jilladams commented 1 month ago

@Agile6MSkinner @omahane I reread comments and added some notes to the body of the ticket. In order to close this, I think we need to:

[ ] Get @omahane sign off on my edits
[ ] Get the one open question resolved

Noting that #19366 and #19369 are follow up tickets queued for next refinement.

jilladams commented 1 month ago

Fromplanning: we are ok not pursuing realtime verification. CLosing!

department-of-veterans-affairs / va.gov-cms