Gorouter should retry within limited period of time

giner commented 7 years ago

Thanks for submitting an issue to gorouter. We are always trying to improve! To help us, please fill out the following template.

Issue

Single misbehaving or overloaded application can exhaust gorouter

Context

cf-release 250

Steps to Reproduce

Deploy a single instance app (can be "go hello world")
Run load test (I used Vegeta) against the app with as high RPS so that "502" starts to appear in application logs
See what is happening to gorouter (number of opened connections, load average)
Stop load test and keep watching application logs

Expected result

Load growth on gorouter should be close to linear
As soon as load test stops application should stop receiving requests almost immediately

Current result

Load growth on gorouter grows fast and it is easy to exhaust it
After load test stopped application continue to receive requests from gorouter for quite a long time (few minutes)

Possible Fix

Right now gorouter retries 3 times in case a connection is not accepted by an application instance. The problem is that retries are not limited in time and depending on behavior of the application they can take 10th of seconds or even minutes. One oossible fix would be to implement MaxRetriesTimeout in addition to MaxRetries. Please read comments.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/148335845

The labels on this github issue will be updated when the story is started.

CAFxX commented 7 years ago

After load test stopped application continue to receive requests from gorouter for quite a long time (few minutes)

Just to clarify: this means that gorouter seems to keep retrying even if the client has gone away. This is a fairly surprising behavior: if the client has gone away retries (more in general, any upstream connection for that client connection) should be immediately aborted.

shalako commented 7 years ago

The problem is that retries are not limited in time and depending on behavior of the application they can take 10th of seconds or even minutes.

Gorouter will retry only if it cannot initiate a TCP connection with a selected backend. Gorouter has a dial timeout of 5 seconds. This means that if Gorouter cannot establish a TCP connection to any registered backends, it will return a 502 in a maximum of 15 seconds. On its own, do you find this retry logic to be a problem?

If Gorouter can establish a TCP connection then it will wait for a response with a timeout defined by request_timeout_in_seconds.

When you are running your load test, are you canceling connections? I don't understand how your test could finish and Gorouter is still retrying requests. This would indicate the component downstream to Gorouter still has many connections open and is waiting for responses.

We are aware that Gorouter has an issue where file descriptors are not cleaned up immediately when a client cancels a connection (e.g. killall curl). We are investigating this.

We have added a timeout that closes idle client connections after 5 seconds, but this doesn't address connections with an active request (Gorouter waiting for a response, or retrying backends).

shalako commented 7 years ago

In https://www.pivotaltracker.com/story/show/147977289 we are adding support for a max connections per backend property that should guard against an unresponsive app or a DOS attack against a single app from exhausting the file descriptors in Gorouter.

giner commented 7 years ago

Hi @shalako,

I have realized that I'm not always able to reproduce the issue. I'll spend some more time trying to figure out the conditions and isolate it.

giner commented 7 years ago

I think I have a bigger picture now.

Let's consider the request flow withing a simplified scheme: client -> gorouter -> backend => => backend -> gorouter -> client

and some hypothetical situations:

client has gone silently (actually gone or considered to be gone missing FIN or RST)
client is very slow to send a request
client is very slow to read a response
backend has gone silently (actually gone or considered to be gone missing FIN or RST)
backend is very slow to read
backend is very slow to reply
gorouter is slow (due to load average being > number of cores)

All above can happen when:

during writing (from client to gorouter, from gorouter to backend)
during reading (from backend by gorouter, from gorouter by client)

The situations above cause a lot of connections (including half-open connections) and some other undesired behaviors which in turn cause load on gorouter (and we see these problems from time to time).

Ideal story would be:

gorouter is, as much as possible, not affected
other clients and backends are not affected
other clients of the same backend are, as much as possible, not affected

Here is what we can do:

Configure reasonably short read / write timeout to backend. Note: I found "endpoint_timeout" in gorouter which defaults to 900s (which is too long) but it is not possible to change it now without affecting "drain_timeout". Here is a PR for to expose drain_timeout https://github.com/cloudfoundry-incubator/routing-release/pull/82.
Implement read / write timeout for to client (I didn't manage to find one in gorouter configuration).
Implement TCP keepalive to backed (in order not to wait for endpoint_timeout if backend is actually gone) #185
Add an option to limit number of connections per backend (as already mentioned above) https://www.pivotaltracker.com/story/show/147977289

giner commented 7 years ago

Here are some pictures.

Backend is slow to reply case
Data is still coming from backend but client has gone case (note that endpoint_timeout on gorouter of router_dev_z3 is lowered from 900 to 75)

shashwathi commented 7 years ago

@giner :

Going through the list of things we can do as suggested

Configure reasonably short read / write timeout to backend. Note: I found "endpoint_timeout" in gorouter which defaults to 900s (which is too long) but it is not possible to change it now without affecting "drain_timeout". Here is a PR for to expose drain_timeout cloudfoundry-incubator/routing-release#82.

Waiting for @shalako to prioritize that issue.

Implement read / write timeout for to client (I didn't manage to find one in gorouter configuration).

Currently gorouter config does not expose read and write timeout for client. Looking through the source code we do not have any hard coded limits either which means there is no limit on this I/O operation.

Implement TCP keepalive to backed (in order not to wait for endpoint_timeout if backend is actually gone) #185

Waiting for @shalako to prioritize that issue.

Add an option to limit number of connections per backend (as already mentioned above) https://www.pivotaltracker.com/story/show/147977289

Already finished and in final release of routing.

So only thing left in this issue would be consider connection read and write limits. Is it reasonable to wait until the recommendations are fixed and check if the issue still persists?

Regards Shash

shalako commented 5 years ago

Max Connections per backend is supported. cloudfoundry/routing-release#82 was closed by @giner. We've closed #185 in favor of support for the same feature in Envoy.

As that addresses the suggested solutions, we'll close this one, too.

cloudfoundry / gorouter