Submitted By: edison.velez
Topic: Something Else
Team: DevOps Support
Hello team, we are currently facing two issues (unsure yet if related) affecting Lighthouse APIs: increased 504 responses and DNS resolution failures.
For the 504s, we started seeing a significant increase of these, along with 502s after the EKS migration. There was some tweaking done in increasing capacity for EKS replicas and since then the ammount of 504s have significantly decreased, particularly for openid_auth. and fhir However, we still see a good amount of 504s for Lighthouse APIs on vets-api side, like claims-v1. These sporadically trigger our healtcheck probes for these APIs.
The DNS issues we started seeing around 04/13, primarily for Sandbox only, but as of yesterday we started to see some in Prod. We have a ticket open with NEO to investigate these, however they are claiming that they don't see any traffic drops or failures on their end. This area is bit out of my reach in terms of understanding. I know there is some DNS resolution handled by Unbound on VSP side, but I don't have the full picture on how it all works. Would you guys be able to help us troubleshoot this issue to see if we can determine what is causing these issues?
Submitted By: edison.velez Topic: Something Else Team: DevOps Support
Hello team, we are currently facing two issues (unsure yet if related) affecting Lighthouse APIs: increased
504
responses and DNS resolution failures.For the
504
s, we started seeing a significant increase of these, along with502
s after the EKS migration. There was some tweaking done in increasing capacity for EKS replicas and since then the ammount of504
s have significantly decreased, particularly foropenid_auth
. andfhir
However, we still see a good amount of504
s for Lighthouse APIs on vets-api side, likeclaims-v1
. These sporadically trigger our healtcheck probes for these APIs.The DNS issues we started seeing around 04/13, primarily for Sandbox only, but as of yesterday we started to see some in Prod. We have a ticket open with NEO to investigate these, however they are claiming that they don't see any traffic drops or failures on their end. This area is bit out of my reach in terms of understanding. I know there is some DNS resolution handled by Unbound on VSP side, but I don't have the full picture on how it all works. Would you guys be able to help us troubleshoot this issue to see if we can determine what is causing these issues?