akka / akka-http

The Streaming-first HTTP server/module of Akka
https://doc.akka.io/docs/akka-http
Other
1.34k stars 595 forks source link

TCP connection resets when CPU is limited #4164

Open istreeter opened 2 years ago

istreeter commented 2 years ago

I am finding that when a akka http server is run with very limited CPU then we get some TCP RSTs in the period soon after clients first start sending requests. These are the conditions that cause the errors:

Most requests are successful, but for some requests I see this error message in the client logs: "connection reset by peer". The errors all come within the first minute of the clients sending requests; subsequently there no more errors. There are no errors in the server logs, even with debug enabled.

Demo

I created this extemely simple server app to demonstrate the problem. It uses akka-http version 10.2.10 and akka version 2.6.20. I use the default akka configuration settings. The server has a single route which extracts the request and then responds with "OK".

First, I run it locally like this, pinning it to a single CPU:

taskset --cpu-list 0 \
  java \
  -XX:ActiveProcessorCount=1 \
  -Xmx1500m -Xms1500m \
  -Dakka.http.server.max-connections=2048 \
  -jar akka-http-simplest-assembly-0.1.0-SNAPSHOT.jar

Next, I limit the available CPU even further:

cpulimit --pid <PID> --limit 5

Then I run fortio to simulate load using 600 parallel connections, and posting a 7kb payload file:

fortio load -t 0 -qps 800 -c 600 -n 0 -timeout 60s -allow-initial-errors -payload-file ./7kb_payload http://127.0.0.1:8080/ping

Is this a bug?

I know, on the one hand I am slightly abusing this server by making it handle so many large requests with limited resources. But on the other hand, that doesn't seem like an excuse to see a RST. I would understand seeing a connection refused or a timeout error due to high load, but a RST seems more like a bug somewhere.

The reason I care about this... we see some errors in production, which I think is due to the same problem. The production setup is like this:

We find that when a new pod gets added to the service (and once it responds healthy), the load balancer routes a surge of requests very quickly to the new pod in one big hit. Some end clients then receive 502 error responses, and I believe this is because the akka server sends a RST to the load balancer.

By the way, to check if my demo is "fair" I ran the same test using a simple server written with https4s, not akka. The alternative app is over here. With the alternative app, and with identical test conditions, all http responses were successful.

istreeter commented 2 years ago

Edit: Fixed the link to the demo app.

johanandren commented 2 years ago

I expect in the real thing you actually consume those request bodies? Seems like a problem with the sample.

Do you see all those 600 connections coming through and actually hitting the service, or does the GCP load balancer terminate and do fewer connections? Any particular reason you are tuning up the max connections from the default (rather than down) when the resources are limited?

istreeter commented 2 years ago

Thanks for looking at this @johanandren! I appreciate this is a tricky one to investigate.

I expect in the real thing you actually consume those request bodies?

The akka-http framework might start to consume the request bodies, but our application code (the route handler) does not receive the requests. I know this because in testing I can match up the number of requests sent with the number of requests logged by our http service.

Do you see all those 600 connections coming through and actually hitting the service

I believe that all 600 connections reach the service. There are a few reasons I think this:

or does the GCP load balancer terminate and do fewer connections?

The load balancer does not do fewer connections. It just keeps sending a split of the traffic as long as the backend service is healthy.

Any particular reason you are tuning up the max connections from the default (rather than down) when the resources are limited?

During normal operation (i.e. after the problematic first minute), the akka-http backend server can handle many hundreds of requests per second, even with the limited cpu. We find we need to increase the number of connections in order to make full use of the cpu. If we do not increase the number of connections, then much of the cpu is unused, which is wasteful. We would have to run more servers to handle the traffic, whereas we want to run as few servers as possible, and maximise cpu usage on each server

But that all refers to the second minute and onwards. It's only the first minute where we get the 502 responses.

johanandren commented 2 years ago

Thanks for the extra details, that it is reproducible getting RSTs locally without a balancer in front as well should at least simplify investigations a little bit.