department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 202 forks source link

Hello team, we are currently facing two issues (unsure yet i... #57084

Closed platform-support-slack-integration[bot] closed 1 year ago

platform-support-slack-integration[bot] commented 1 year ago

Submitted By: edison.velez Topic: Something Else Team: DevOps Support

Hello team, we are currently facing two issues (unsure yet if related) affecting Lighthouse APIs: increased 504 responses and DNS resolution failures.

For the 504s, we started seeing a significant increase of these, along with 502s after the EKS migration. There was some tweaking done in increasing capacity for EKS replicas and since then the ammount of 504s have significantly decreased, particularly for openid_auth. and fhir However, we still see a good amount of 504s for Lighthouse APIs on vets-api side, like claims-v1. These sporadically trigger our healtcheck probes for these APIs.

The DNS issues we started seeing around 04/13, primarily for Sandbox only, but as of yesterday we started to see some in Prod. We have a ticket open with NEO to investigate these, however they are claiming that they don't see any traffic drops or failures on their end. This area is bit out of my reach in terms of understanding. I know there is some DNS resolution handled by Unbound on VSP side, but I don't have the full picture on how it all works. Would you guys be able to help us troubleshoot this issue to see if we can determine what is causing these issues?

platform-support-slack-integration[bot] commented 1 year ago

Slack Thread Link: https://dsva.slack.com/archives/CBU0KDSB1/p1681915368038599

alyssagallion commented 1 year ago

Seemingly resolved, closing.