department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

BE + Team | Profile | Direct Deposit - Faraday Issue | Complete Post-Mortem #82154

Closed mtcA6 closed 6 months ago

mtcA6 commented 6 months ago

Background

This ticket is so we can complete our postmortem for the Faraday logging issue

use platform documentation to finalize this

Issue ticket where we fixed the logging

mtcA6 commented 6 months ago

seems this was finalized closing this ticket out

Subject: RE: Compensation and Pension Information Data in Data Dog

Hi Tonya,

My apologies for the delay. We have fully mitigated the incident and processes have been put in place to prevent future occurrences. Below my signature I’ve appended the post-mortem. Please let Chante and me know if you need anything else.

Regards,

Jonathan Kamens

OIT/OCTO Information Security Lead

2024-03-22 - C&P Direct Deposit Issue

This document is a work in progress...

Summary

On 03/22/2024, we discovered that C&P direct deposit banking information was being logged to Datadog. Logs containing banking information started appearing 12/05/2023.

Impact

In Datadog, the number of logs affected between Dec 5th, 2023 and Mar 22nd, 2023 is 21,915 logs (approximately 3,500 users).

Additional concerns:

If the Faraday update is causing sensitive data to logged from other areas of vets-api. Whether banking information was sent to Sentry. These concerns were addressed and are elaborated on in the subsequent sections.

Ownership

Team: Authenticated Experience Profile on VA.gov PO: Chante Lantos-Swett Stakeholders

Authenticated Experience Profile Team:

Tom Harrison: Backend Enginner Travis Cahill: Project Manager Alex Parker: Delivery Manager Chante Lantos-Swett: Product Owner Root Cause Analysis

Upon deeper inspection, we realized that banking data only appeared in logs related to C&P update failures. For some reason, the request information was being included in the error response log.

The vets-api uses a 3rd party library called Faraday to communicate with external APIs. When an error response was returned from an API call, a Faraday error response object was generated and the data within it was sent to Datadog. Between Dec 4-5th, two PR were merged that updated the Faraday library from version 0.17.6 to version 1.10.0:

14552 14755 The error response object in Faraday 0.17.6 contained the following: status, headers, and body. Each of these fields contain data that is related to the response.

The error response object in Faraday 1.10.0 contained the following: status, headers, body, and request. The request contains the data sent in the request (the banking information to be updated). Since the whole response was being logged, the request information was being logged as well.

The commit history of the Faraday library confirms that this change was made to the raise_error class on Sept 12, 2023. Here is the relevant comment...

Include the request data by default.

Resolution

PR #16048 was opened to temporarily remove the specific line that was logging the full error response. An additional PR was opened to update the logging to exclude request data.

On Mar 27th, Ryan McNeil was notified of the issue. The reliability team opened PR #16105, which essentially removes the request data from the Faraday error response by setting an option within the Faraday middleware.

What went well

We didn't find any other instances where PII/PCI data was being logged.

Upon inspecting the Sentry logs, it was revealed that banking information fields are being filtered out.

Example Log...

{

"accountNumber": "[Filtered]",

"accountType": "[Filtered]",

"routingNumber": "[Filtered]"

}

Upon discovering the issue, our team collaborated effectively to promptly pinpoint the cause, escalate the matter, and implement corrective measures. Several team members contributed by providing assistance, suggestions, and taking decisive action. Special acknowledgment goes to Joe Niquette and Steve Albers for their valuable assistance and guidance during this process.

Timeline

03-21-2024 @ 8:32 PM ET: While troubleshooting another issue, discovered banking information was included in a log. 03-21-2024 @ 8:55 PM ET: Confirmed that it was not an isolated issue. Determined that there were XXX logs containing banking information starting on Dec 05, 2023 at 15:40:27.037 ET. 03-21-2024 @ 10:11 PM ET: Opened PR #16048 to temporarily remove the specific line that was logging the full error response. 03-21-2024 @ 11:18 PM ET: Sent an email to Datadog contact support notifying them of the issue to request that these logs be purged. 03-21-2024 @ 11:32 PM ET: Contacted Chante Lantos-Swett (PO), Travis Cahill (PM), and Alexandria Parker (DM). 03-22-2024 @ 12:52 AM ET: Datadog support replied by email. 03-22-2024 @ 8:48 AM ET: Met with Travis and Alex to review situation and discuss plan of action. At this point, we suspected the Faraday update caused the issue, but weren't sure. 03-22-2024 @ 9:00-10:00 AM ET: Alex opened a DOTS ticket to find someone with Datadog Admin rights. 03-22-2024 @ 10:45 AM ET: Discussed root cause with Steven Cummings. He locateed and confirmed the specific change to the Faraday library that was causing the request data to be logged. 03-22-2024 @ 10:52 AM ET: Met with team to confirm that the issue was related to the Faraday update. Also started checking Sentry logs for any banking data. None found. 03-22-2024 @ ~ 12:00 PM ET: PR #16048 merged. 03-22-2024 @ 1:00-2:00 PM ET: Asked J Shoemaker and A Hampton (engineers on benefits teams) to review PR #16048. 03-27-2024 @ 9AM - 12PM ET: Notified Ryan McNeil of this issue. The reliability team opened PR #16105, which removes request data from the Faraday error response. 03-27-2024 @ 2:15 PM ET: Steve Albers opened a Platform Support ticket requesting that the logs be restricted in Datadog. 03-29-2024 @ 8:13 AM ET: Logs were restricted in Datadog. Verified with queries. 04-08-2024 @ 8:18 PM ET: Datadog help support ticket closed. Lessons Learned

Whenever a major version upgrade is made to a core dependency, potential issues may arise. In this instance, the problem was somewhat elusive and challenging to detect.

Here are some things we can do to minimize the chances of something like this happening again:

Refine error logging strategy to be more precise. It's unexpected that Faraday's default approach includes the request in an error response, but it occurred. Focusing on logging specific response fields, rather than the entire response, should prevent this issue from happening again. Configure Datadog to mask specific fields when they are logged, ensuring that sensitive values are protected. We'll need to reach out to the Datadog support team to inquire about the feasibility of this. Compose a query that verifies the presence of sensitive fields and incorporate it into our dashboard. We'll need to reach out to the Datadog support team to inquire about the feasibility of this.