department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

Custodianship of Veteran submitted data by preventing data loss in background sidekiq jobs #75854

Open AparnaNittalaUSDS opened 7 months ago

AparnaNittalaUSDS commented 7 months ago

Problem Statement

Veterans submit a variety of data everyday through va.gov website by interacting with different services like health benefits, claims etc. As a provider of these services, it is the accountability of the va.gov platform and applications to ensure that the data submitted is not lost and reaches the targeted destination while preserving data integrity. Technically, a set of background jobs using SideKiq framework are designed, implemented and employed by platform and application teams to handle this data throughout the form submission workflows.

However, many gaps in the robustness of these jobs have come to light as part of the recent Code Yellow initiative that aimed at improving monitoring and observability of the platform and applications on va.gov. Several jobs have been seen in Death Queue in Datadog for lack of retrying or other handling mechanisms in place, thus causing potential data loss submitted by Veterans. Moreover, there is no process for application and platform teams to monitor or be notified of these these "job deaths" and thereby execute remediation actions.

There are already recorded three Instances of data loss due to the sidekiq jobs:

Hence, the critical need of the hour is to define and implement a framework and process to design the jobs better, monitor them, bring visibility to these jobs, and have a checklist of remediation action items when job ends up in death queue, and establish governance and best practices for data-handling. This enables va.gov on the whole to become better custodians of Veteran submitted data and improve trust and confidence of the veterans.

Hypothesis or Bet

This initiative will bring accountability and rigor in vastly improving the way we design background jobs handling veteran submitted data, so that there is zero loss of critical data and any failures are immediately recognized and rectified in a clear and streamlined process

User Impact

Loss of data potentially impacts all Veterans who have submitted data through va.gov following a form-submission workflow. Downstream systems who depend on va.gov for feeding them data will also be impacted if data doesn't reach them, and it blocks them from initiating workflows that are dependent on this data.

Where was this problem reported?

This problem was one of the discovery outputs of the Code Yellow Initiative.

What do we not know about the problem space?

What (if any) research or discovery has been done?

Discussions and reviews have been initiated with teams by Platform Engineering Lead (Jeff) and VFS application Engineering Lead (Steve Albers). Assessment notes to be included here:

What is the acceptance criteria?

-[] Establish a formal definition of the data criticality -[] A Governance Model / Best Practices document is available in confluence for teams to adhere -[] Action items, Strategy and Timeline commitment from VFS application teams on how they will fix issues and enhance their sidekiq jobs -[] Appropriate Error handling and fail-over mechanisms in place for the sidekiq jobs starting with critical and high priority -[] Every background job should have data retention strategy for the data submitted by Veterans (verification that data is handed off to the right destination and post-confirmation clean up) -[] Review of the sidekiq jobs is Integrated into the Collab cycle -[] Adhere to tagging guidelines for the Sidekiq job monitors as established with ECC and Datadog in the canvas. Slack thread here

How should we measure success?

-[] No background job that deals with veteran submitted data (through forms) should show up in Datadog Death Queue -[] All jobs that end up in Death Queue should be fixed with appropriate fail over and retry mechanisms and rectified in xx days

TODOs

EricaRobbins commented 2 weeks ago

@JeffKeeneVAGov @humancompanion-usds Discussion (Jeff / Matt) - determine who will own which part of this work.