TCP connection resets when CPU is limited

istreeter commented 2 years ago

I am finding that when a akka http server is run with very limited CPU then we get some TCP RSTs in the period soon after clients first start sending requests. These are the conditions that cause the errors:

Akka http server runs with limited CPU.
Many clients open connections simultaneously to the server in an initial burst
Clients send fairly large http requests (I tested with 7kb)

Most requests are successful, but for some requests I see this error message in the client logs: "connection reset by peer". The errors all come within the first minute of the clients sending requests; subsequently there no more errors. There are no errors in the server logs, even with debug enabled.

Demo

I created this extemely simple server app to demonstrate the problem. It uses akka-http version 10.2.10 and akka version 2.6.20. I use the default akka configuration settings. The server has a single route which extracts the request and then responds with "OK".

First, I run it locally like this, pinning it to a single CPU:

taskset --cpu-list 0 \
  java \
  -XX:ActiveProcessorCount=1 \
  -Xmx1500m -Xms1500m \
  -Dakka.http.server.max-connections=2048 \
  -jar akka-http-simplest-assembly-0.1.0-SNAPSHOT.jar

Next, I limit the available CPU even further:

cpulimit --pid <PID> --limit 5

Then I run fortio to simulate load using 600 parallel connections, and posting a 7kb payload file:

fortio load -t 0 -qps 800 -c 600 -n 0 -timeout 60s -allow-initial-errors -payload-file ./7kb_payload http://127.0.0.1:8080/ping

Is this a bug?

I know, on the one hand I am slightly abusing this server by making it handle so many large requests with limited resources. But on the other hand, that doesn't seem like an excuse to see a RST. I would understand seeing a connection refused or a timeout error due to high load, but a RST seems more like a bug somewhere.

The reason I care about this... we see some errors in production, which I think is due to the same problem. The production setup is like this:

Akka server runs as a pod on GCP Kubernetes engine.
The cluster node machine type is n1-standard-1 (1 vCPU).
The pod requests 400 cpu shares.
GCP external load balancer in front of the kubernetes service

We find that when a new pod gets added to the service (and once it responds healthy), the load balancer routes a surge of requests very quickly to the new pod in one big hit. Some end clients then receive 502 error responses, and I believe this is because the akka server sends a RST to the load balancer.

By the way, to check if my demo is "fair" I ran the same test using a simple server written with https4s, not akka. The alternative app is over here. With the alternative app, and with identical test conditions, all http responses were successful.

istreeter commented 2 years ago

Edit: Fixed the link to the demo app.

johanandren commented 2 years ago

I expect in the real thing you actually consume those request bodies? Seems like a problem with the sample.

Do you see all those 600 connections coming through and actually hitting the service, or does the GCP load balancer terminate and do fewer connections? Any particular reason you are tuning up the max connections from the default (rather than down) when the resources are limited?

istreeter commented 2 years ago

Thanks for looking at this @johanandren! I appreciate this is a tricky one to investigate.

I expect in the real thing you actually consume those request bodies?

The akka-http framework might start to consume the request bodies, but our application code (the route handler) does not receive the requests. I know this because in testing I can match up the number of requests sent with the number of requests logged by our http service.

Do you see all those 600 connections coming through and actually hitting the service

I believe that all 600 connections reach the service. There are a few reasons I think this:

If we replace the akka server with nginx, then there are no 502s. This tells me there is not any networking problem between the load balancer and the backend service.
I see the tcp RSTs in the test I described above, which is run completely locally. In the local setup I can be sure that all connections are reaching the local server.
By definition a 502 means the load balancer received an invalid response from the upstream server. So I think that must mean a connection was at least partially established.

or does the GCP load balancer terminate and do fewer connections?

The load balancer does not do fewer connections. It just keeps sending a split of the traffic as long as the backend service is healthy.

Any particular reason you are tuning up the max connections from the default (rather than down) when the resources are limited?

During normal operation (i.e. after the problematic first minute), the akka-http backend server can handle many hundreds of requests per second, even with the limited cpu. We find we need to increase the number of connections in order to make full use of the cpu. If we do not increase the number of connections, then much of the cpu is unused, which is wasteful. We would have to run more servers to handle the traffic, whereas we want to run as few servers as possible, and maximise cpu usage on each server

But that all refers to the second minute and onwards. It's only the first minute where we get the 502 responses.

johanandren commented 2 years ago

Thanks for the extra details, that it is reproducible getting RSTs locally without a balancer in front as well should at least simplify investigations a little bit.

akka / akka-http

TCP connection resets when CPU is limited #4164

Demo

Is this a bug?