department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

Tracking View Payments / Ch33 Performance Issue #34248

Closed RudyOnRails closed 2 years ago

RudyOnRails commented 2 years ago

Background

This ticket Epic's purpose is to document what happened, develop questions to ask or unknowns to know, and store any research and/or evidence to support decisions going forward.

TL;DR

The Ch33 direct deposit endpoint and the payment information endpoint have both been down/mostly down for a similar period of time (since Wednesday, Dec 8) and are both supported by the same backend system -- BGS.

Ch33

December 1, Samara noticed an issue on prod and asked if someone could check the Profile in prod issue was very sporadic and could not be reproduced in lower environments or locally. Occasionally it would fail to load all profile information and this alert would be displayed:

![image](https://user-images.githubusercontent.com/73354907/146441319-c4000807-b279-4eb6-bc41-54e89f26cd37.jpeg)

December 8, Lihan contacted BGS via email to inquire about DDEFT latency. We received no answers that day, but they did loop in other teams (Team 1, WebLogic Admin)

December 9, Cory Easley looped in Tuxedo team around 7:30 AM ET. We got a response back from Vehkaiah Kolla at around 10:15 AM ET saying the Tuxedo and Bull admins did not notice any slowdowns. A joint call was set up at 12:30 PM ET and most of the afternoon was spent with Lihan Li, Joe Niquette, Lance Sanchez, Boris Ning comparing logs and request/response times with BGS. During some of the calls, the fwd proxy would occasionally fail to connect. We looped in the Ops team (Bill Ryan) for additional support

Dec 10 - 13, Lance Sanchez, Boris Ning, Mike Chelen continued to troubleshoot. Demian Ginther concluded that VA Network may need to be looped in since Ops has not visibility once it leaves the forward proxy.

Dec 14

Dec 15

Dec 16

View Payments

On Wed, Dec 15, Jason was informed that view-payments was lagging/down for the past few days.

Context

Affected Features (in order of appearance)

Findings

Hypotheses #

ID Hypothesis Support Determination
1 The issue only occurs during weekdays and not on weekends Oct 30th was a Sat Screen Shot 2021-10-31 at 11 06 00 PM False
2 The issue became significantly worse after 12/9 Here is the issues days earlier Screen Shot 2021-12-17 at 12 09 36 PM False
3 Logging SemanticLogger is showing signs of (at times hundreds of thousands) logs being queued up Screen Shot 2021-12-20 at 11 29 28 AM Checking results of this PR

Next Steps / Tasks

Results

-

Acceptance Criteria

drorva commented 2 years ago

Office hours with ops meeting 12/17

mchelen-gov commented 2 years ago

@drorva Thanks for the update! Is there a GH issue for the DD alert creation?

RudyOnRails commented 2 years ago

@drorva Thanks for the update! Is there a GH issue for the DD alert creation?

@mchelen-gov actually there is not, good question. I am working on perfecting the monitor now, but will create an issue now.