department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
277 stars 194 forks source link

Sentry: Platform Product team work in partnership with Mobile team #79609

Open jennb33 opened 3 months ago

jennb33 commented 3 months ago

2024 Q1 has seen an increase in activity in Sentry. Whether it's by design, a bug, or a combination of the two, it's overburdening the Sentry instance. PPT1 reviewed and discovered that some of the space issues were being caused by the Mobile team, who will be mitigating the issue.

This ticket is for the PPT1 to do further discovery work, ignore rule and continue to monitor the free space until the Mobile Team has input a new solution in place.

Analysis of the data before January 18th showed a consistent level of Sentry RDS free space (refer to this 3-month window chart). After expanding our storage capacity to 1TB and subsequently to 2TB on March 20, the free space has been diminishing by approximately 1.5% daily.

A specific specific event was pinpointed as the primary contributor to this issue "Access token JWT is malformed" "source:mobile". Although there appeared to be an initial attempt to address this around March 1, a notable increase in event volume occurred again on March 12, with nearly 5 million occurrences of these events now logged in Sentry.

At the current trajectory, Sentry's capacity is projected to be overwhelmed approximately every 30 days. For further context, here are the associated logs: Event Logs.

According to the Mobile team, they have run into this issue a few times in the past, primarily on the Sync screen. It was usually caused by the queries running before the "loggedIn" status was true. The fix has usually been to disable the offending query whenever the user isn't logged in. The last time this issue was reviewed, the engineer remembered looking at preventing all queries from running if the loggedIn status wasn't true. It wasn't changed back then since it was only one query causing the problem, but because this is a recurring issue, it may be the path we need to go. Updating the stale time definitely wouldn't hurt, right now its 5 seconds and Mobile is recommending changing it to 90-120 seconds.

Mobile also recommends that merging the Auth React Query Migration and HSP feature branch together to mitigate the issue. The auth RQ migration if merged first should cap the instances of this to 9 per user per logout(due to the sync screen api calls) until HSP is merged(allowing us to move the api calls off the sync screen to the home screen), if HSP is merged first then the auth RQ migration should fix all the issues.

(all of this was taken from this Slack thread)

jennb33 commented 2 months ago

Aparna and Jeff need to talk to the Mobile PO to get this done, per Clint on 4/10/2024

LindseySaari commented 1 month ago

It appears that the Identity team deployed a fix that reduced the JWT errors under consideration here!

Screenshot 2024-05-29 at 12.03.44 PM.png