Closed ElijahLynn closed 2 years ago
We suspected that this might have been resolved by #5491 #5550 and was Memcache out of memory that was the issue, however, I just tested another batch of PRs and a bunch of them still have the Bad Gateway issue. Working on seeing if there are correlating ALB logs and Apache logs on Tugboat that show the TCP connection is being made.
Update: @ndouglas and I now suspect that this is only caused by a large amount of concurrent requests as it is only reproduceable when I trigger 5+ PRs on our test project on the same CI server AND the vets-website-branch-builds. In both cases, it is the parallel/concurrent requests steps of BOTH the /graphql request and the parallel file assets download stage.
This happens both from Jenkins and from within Tugboat itself as it goes out of Tugboat to the real DNS that maps to the balancer and then back in.
This is still in progress. We did see the issue happen again with load spiking higher than 100% CPU on Tugboat, which has 72 cores and spiked to 140 load, which means things were getting queued by the Linux CPU scheduler.
This may be the root cause of things and is where having Tugboat be reporting to Prometheus > Grafana would be really useful, especially combining with Loki. But we don't have that just yet and will dig into this issue a bit more to see about CPU restricting projects.
It does seem like the avg. load for a given build has increased by a bit.
downgrading to a regular defect as it only happens intermittently, doesn't affect users, and has a workaround (re=run the failed job)
This was just reported by @timwright12 and @k80bowman in https://dsva.slack.com/archives/C0MQ281DJ/p1626895567475700.
This is occurring on Prod: https://dsva.slack.com/archives/C0MQ281DJ/p1635344477055100 and an increasing rate
This is occurring on Prod: https://dsva.slack.com/archives/C0MQ281DJ/p1635344477055100 and an increasing rate
We discussed on scrum and are pulling into current sprint.
Recommend splitting out the monitoring AC into separate ticket, and continue triaging/investigation with current ticket
We have a CloudWatch metric~region~'us-gov-west-1~start~'-P42D~end~'P0D);query=~'7bAWS2fApplicationELB2cLoadBalancer7d20cms20prod) for 502's experienced by the prod CMS ALB:
It appears that these are more frequent during the work week and less frequent on weekends, and seem to happen almost always during American work hours. There are some interesting correlations with other load balancer metrics:
But of course we can't say whether any of this is causal.
Backing out a bit, and looking at the RDS metrics, I saw comparatively few correlations. WriteIOPS, etc seemed completely unrelated. RDS connections often correlated, but in some places remained fairly flat:
I don't think that this is an issue with DBMS load. Or, after looking at some other options, EC2 load or network, Memcache load, etc. There are a lot of things that correlate with the incoming connections, and the occurrence of 502s is one of them... but the correlation is not perfect. This graph is a good example of the general trend:
Any explanation for the occurrence of these 502s has to account for these:
I am inclined to suspect that these are somewhat random and not a result of load on any system, or even particular to the CMS itself. Of course, since we are but a mere frosted flake drifting upon the milk of the Platform, that would imply that other services are experiencing 502s as well, but aren't reporting them or they're being transparently retried or something.
Hmmm...
Hmmm...
Most or all services seem to encounter 502s; some more than others, but this might due more to differential traffic than a quality of the service itself.
To examine that idea a little deeper, we can look at other services and their frequency of 502s vs. their active connection count.
First, CMS Prod:
And revproxy:
And eFolder:
And Jenkins:
And Sentry:
But it would be more useful to compare the fractions of each service's requests that are 502s. So creating metrics for 502s/active connections, and plotting them on the same graph, we see:
No one service has such a high error rate that it skews the graph. RevProxy's error rate is low enough, though, that it forms merely an orange line at the bottom of the graph:
Why is its error rate so low? It has an active connection count that is a couple of orders of magnitude larger than any other service, but its error count is no larger. Its error count does not correlate with its connection count to the same degree that the other services do.
@ndouglas this is some great information.
Do you feel this is a larger issue or an issue with our ELBs? Should we engage AWS?
@indytechcook The monitoring/logging is insufficient for anything deeper from me, and I TBH lack the knowledge to go much further. I feel like we need a network engineer. IMHO def. engage AWS, probably rope in DevOps.
I think this is probably affecting everyone but the frequency is so low that it just doesn't cause a memorable issue.
The nice thing about CMS is that if this interrupts a content build or deploy, it aggravates a whole lot of people 😂
The Sentry thing alone is aggravating -- that's a decent amount of important debugging info that's just getting lost.
Looking at the AWS documentation for 502 on ELB, it says it's an error response from the server that it can't parse: https://aws.amazon.com/premiumsupport/knowledge-center/load-balancer-http-502-errors/ ?
Yeah. I was going to dig in further but had to prepare for demo.
I poked through the logs and didn't find any cases where the CMS was returning a 502, so we can rule out one possibility. But that's only mildly useful, since I don't think we have infrastructure on BRD that would merit the CMS actually returning a 502. But, hey, at least it's eliminated.
The web server or associated backend application servers running on EC2 instances return a message that can't be parsed by your Classic Load Balancer.
I read this as the status doesn't matter as much.
Do you feel this is a larger issue or an issue with our ELBs? Should we engage AWS?
I do like the idea of opening a chat support ticket with AWS.
First, CMS Prod:
And revproxy:
And eFolder:
And Jenkins:
And Sentry:
Great digging Nate, love that we can see that pattern affecting other services too!!!
Opened a support thread here with Ops: https://dsva.slack.com/archives/CBU0KDSB1/p1637194171309300
Opened a support case with AWS here: https://console.amazonaws-us-gov.com/support/home#/case/?displayId=9221808181&language=en .
@jkalexander7 @ElijahLynn I updated the AC's on the original issue to reflect the actual work that we've done. If this sounds reasonable, I'll open up a followup issue to track the interactions with AWS and other teams and consider this one closed.
Let's get the follow up issue logged and close this as part of last sprint.
Issue opened to track progress with AWS support: https://github.com/department-of-veterans-affairs/va.gov-cms/issues/7023
Our CI environments are seeing intermittent 502 bad gateway responses in the ALB logs which are causing multiple PR failures on vets-website branch build PRs and CMS PRs.
This is happening to multiple environments and multiple types (GET and POST) requests and multiple MIME types (text/html & image/jpeg). The ALB docs suggest that in this example below the connection was NOT established to the target and/or the target did not send a response.
-1
: response_processing_time: This value is set to -1 if the load balancer can't send the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.502
: elb_status_code: The status code of the response from the load balancer.-
: target_status_code: The status code of the response from the target. This value is recorded only if a connection was established to the target and the target sent a response. Otherwise, it is set to -.I did try a 500 request loop on the CI server itself with
for i in $(seq 1 500); do printf "$i: " ; curl -s -o /dev/null -w "%{http_code}\n" --insecure https://cms-vets-website-branch-builds-lo9uhqj18nwixunsjadvjsynuni7kk1u.ci.cms.va.gov/; done
and it was all good, same 200 response code.AC: