department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
97 stars 69 forks source link

Investigate Intermittent 502 Bad Gateway responses on CI #5552

Closed ElijahLynn closed 2 years ago

ElijahLynn commented 3 years ago

Our CI environments are seeing intermittent 502 bad gateway responses in the ALB logs which are causing multiple PR failures on vets-website branch build PRs and CMS PRs.

This is happening to multiple environments and multiple types (GET and POST) requests and multiple MIME types (text/html & image/jpeg). The ALB docs suggest that in this example below the connection was NOT established to the target and/or the target did not send a response.

-1: response_processing_time: This value is set to -1 if the load balancer can't send the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request. 502: elb_status_code: The status code of the response from the load balancer. -: target_status_code: The status code of the response from the target. This value is recorded only if a connection was established to the target and the target sent a response. Otherwise, it is set to -.

40602:https 2021-06-08T22:34:21.038282Z app/dsva-vagov-utility-tools/30a7782e04179703 10.247.142.71:36382 172.31.1.159:443 0.000 0.149 -1 502 - 1470 272 "POST https://cms-vets-website-branch-builds-lo9uhqj18nwixunsjadvjsynuni7kk1u.
i.cms.va.gov:443/graphql HTTP/1.1" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws-us-gov:elasticloadbalancing:us-gov-west-1:008577686731:targetgroup/dsva-vagov-utility-tugboat-h
tps/3041ecf209fb8c7e "Root=1-60bff06c-0a2c04986b2e40940f670ba3" "cms-vets-website-branch-builds-lo9uhqj18nwixunsjadvjsynuni7kk1u.ci.cms.va.gov" "session-reused" 3 2021-06-08T22:34:20.888000Z "forward" "-" "-" "172.31.1.159:443" "-"
"-" "-"

I did try a 500 request loop on the CI server itself with for i in $(seq 1 500); do printf "$i: " ; curl -s -o /dev/null -w "%{http_code}\n" --insecure https://cms-vets-website-branch-builds-lo9uhqj18nwixunsjadvjsynuni7kk1u.ci.cms.va.gov/; done and it was all good, same 200 response code.

AC:

ElijahLynn commented 3 years ago

We suspected that this might have been resolved by #5491 #5550 and was Memcache out of memory that was the issue, however, I just tested another batch of PRs and a bunch of them still have the Bad Gateway issue. Working on seeing if there are correlating ALB logs and Apache logs on Tugboat that show the TCP connection is being made.

ElijahLynn commented 3 years ago

Update: @ndouglas and I now suspect that this is only caused by a large amount of concurrent requests as it is only reproduceable when I trigger 5+ PRs on our test project on the same CI server AND the vets-website-branch-builds. In both cases, it is the parallel/concurrent requests steps of BOTH the /graphql request and the parallel file assets download stage.

This happens both from Jenkins and from within Tugboat itself as it goes out of Tugboat to the real DNS that maps to the balancer and then back in.

ElijahLynn commented 3 years ago

This is still in progress. We did see the issue happen again with load spiking higher than 100% CPU on Tugboat, which has 72 cores and spiked to 140 load, which means things were getting queued by the Linux CPU scheduler.

This may be the root cause of things and is where having Tugboat be reporting to Prometheus > Grafana would be really useful, especially combining with Loki. But we don't have that just yet and will dig into this issue a bit more to see about CPU restricting projects.

It does seem like the avg. load for a given build has increased by a bit.

cmaeng commented 3 years ago

downgrading to a regular defect as it only happens intermittently, doesn't affect users, and has a workaround (re=run the failed job)

ElijahLynn commented 3 years ago

This was just reported by @timwright12 and @k80bowman in https://dsva.slack.com/archives/C0MQ281DJ/p1626895567475700.

indytechcook commented 2 years ago

This is occurring on Prod: https://dsva.slack.com/archives/C0MQ281DJ/p1635344477055100 and an increasing rate

ElijahLynn commented 2 years ago

This is occurring on Prod: https://dsva.slack.com/archives/C0MQ281DJ/p1635344477055100 and an increasing rate

We discussed on scrum and are pulling into current sprint.

jkalexander7 commented 2 years ago

Recommend splitting out the monitoring AC into separate ticket, and continue triaging/investigation with current ticket

ndouglas commented 2 years ago

We have a CloudWatch metric~region~'us-gov-west-1~start~'-P42D~end~'P0D);query=~'7bAWS2fApplicationELB2cLoadBalancer7d20cms20prod) for 502's experienced by the prod CMS ALB:

Screen Shot 2021-11-03 at 10 06 33 AM

It appears that these are more frequent during the work week and less frequent on weekends, and seem to happen almost always during American work hours. There are some interesting correlations with other load balancer metrics:

Screen Shot 2021-11-03 at 10 26 12 AM Screen Shot 2021-11-03 at 10 25 49 AM Screen Shot 2021-11-03 at 10 25 27 AM Screen Shot 2021-11-03 at 10 24 55 AM

But of course we can't say whether any of this is causal.

Backing out a bit, and looking at the RDS metrics, I saw comparatively few correlations. WriteIOPS, etc seemed completely unrelated. RDS connections often correlated, but in some places remained fairly flat:

Screen Shot 2021-11-03 at 10 32 18 AM

I don't think that this is an issue with DBMS load. Or, after looking at some other options, EC2 load or network, Memcache load, etc. There are a lot of things that correlate with the incoming connections, and the occurrence of 502s is one of them... but the correlation is not perfect. This graph is a good example of the general trend:

Screen Shot 2021-11-03 at 10 54 33 AM

Any explanation for the occurrence of these 502s has to account for these:

Screen Shot 2021-11-03 at 10 57 11 AM Screen Shot 2021-11-03 at 10 57 25 AM Screen Shot 2021-11-03 at 10 57 33 AM

I am inclined to suspect that these are somewhat random and not a result of load on any system, or even particular to the CMS itself. Of course, since we are but a mere frosted flake drifting upon the milk of the Platform, that would imply that other services are experiencing 502s as well, but aren't reporting them or they're being transparently retried or something.

Screen Shot 2021-11-03 at 11 13 39 AM

Hmmm...

Screen Shot 2021-11-04 at 7 17 21 AM

Hmmm...

Screen Shot 2021-11-04 at 7 26 16 AM

Most or all services seem to encounter 502s; some more than others, but this might due more to differential traffic than a quality of the service itself.

To examine that idea a little deeper, we can look at other services and their frequency of 502s vs. their active connection count.

First, CMS Prod:

Screen Shot 2021-11-04 at 7 34 15 AM

And revproxy:

Screen Shot 2021-11-04 at 7 35 55 AM

And eFolder:

Screen Shot 2021-11-04 at 7 40 51 AM

And Jenkins:

Screen Shot 2021-11-04 at 7 41 46 AM

And Sentry:

Screen Shot 2021-11-04 at 7 42 46 AM

But it would be more useful to compare the fractions of each service's requests that are 502s. So creating metrics for 502s/active connections, and plotting them on the same graph, we see:

Screen Shot 2021-11-04 at 7 50 50 AM

No one service has such a high error rate that it skews the graph. RevProxy's error rate is low enough, though, that it forms merely an orange line at the bottom of the graph:

Screen Shot 2021-11-04 at 7 55 07 AM

Why is its error rate so low? It has an active connection count that is a couple of orders of magnitude larger than any other service, but its error count is no larger. Its error count does not correlate with its connection count to the same degree that the other services do.

indytechcook commented 2 years ago

@ndouglas this is some great information.

Do you feel this is a larger issue or an issue with our ELBs? Should we engage AWS?

ndouglas commented 2 years ago

@indytechcook The monitoring/logging is insufficient for anything deeper from me, and I TBH lack the knowledge to go much further. I feel like we need a network engineer. IMHO def. engage AWS, probably rope in DevOps.

I think this is probably affecting everyone but the frequency is so low that it just doesn't cause a memorable issue.

The nice thing about CMS is that if this interrupts a content build or deploy, it aggravates a whole lot of people 😂

The Sentry thing alone is aggravating -- that's a decent amount of important debugging info that's just getting lost.

indytechcook commented 2 years ago

Looking at the AWS documentation for 502 on ELB, it says it's an error response from the server that it can't parse: https://aws.amazon.com/premiumsupport/knowledge-center/load-balancer-http-502-errors/ ?

ndouglas commented 2 years ago

Yeah. I was going to dig in further but had to prepare for demo.

I poked through the logs and didn't find any cases where the CMS was returning a 502, so we can rule out one possibility. But that's only mildly useful, since I don't think we have infrastructure on BRD that would merit the CMS actually returning a 502. But, hey, at least it's eliminated.

indytechcook commented 2 years ago

The web server or associated backend application servers running on EC2 instances return a message that can't be parsed by your Classic Load Balancer.

I read this as the status doesn't matter as much.

ElijahLynn commented 2 years ago

Do you feel this is a larger issue or an issue with our ELBs? Should we engage AWS?

I do like the idea of opening a chat support ticket with AWS.

ElijahLynn commented 2 years ago

First, CMS Prod:

Screen Shot 2021-11-04 at 7 34 15 AM

And revproxy:

Screen Shot 2021-11-04 at 7 35 55 AM

And eFolder:

Screen Shot 2021-11-04 at 7 40 51 AM

And Jenkins:

Screen Shot 2021-11-04 at 7 41 46 AM

And Sentry:

Screen Shot 2021-11-04 at 7 42 46 AM

Great digging Nate, love that we can see that pattern affecting other services too!!!

ElijahLynn commented 2 years ago

Opened a support thread here with Ops: https://dsva.slack.com/archives/CBU0KDSB1/p1637194171309300

ndouglas commented 2 years ago

Opened a support case with AWS here: https://console.amazonaws-us-gov.com/support/home#/case/?displayId=9221808181&language=en .

ndouglas commented 2 years ago

@jkalexander7 @ElijahLynn I updated the AC's on the original issue to reflect the actual work that we've done. If this sounds reasonable, I'll open up a followup issue to track the interactions with AWS and other teams and consider this one closed.

indytechcook commented 2 years ago

Let's get the follow up issue logged and close this as part of last sprint.

indytechcook commented 2 years ago

Issue opened to track progress with AWS support: https://github.com/department-of-veterans-affairs/va.gov-cms/issues/7023