cloudfoundry / routing-release

This is the BOSH release for cloud foundry routers
Apache License 2.0
42 stars 106 forks source link

Coordinate backend HTTP keep-alive timeouts with buildpack teams #240

Open peterellisjones opened 3 years ago

peterellisjones commented 3 years ago

Is this a security vulnerability?

No

Issue

Hi folks, when using HTTP keep-alive it is important that clients always close connections before servers. This is to avoid the race condition where a client sends a new request on an existing connection just as the server closes the connection. When this happens for a request from the gorouter to an application, the user will get a 502 for non-retriable / non-idempotent requests.

When backend keep-alive connections are enabled via max_idle_connections > 0 , the gorouter uses a 90 second duration for keep alives. Unfortunately this default value is incompatible with the 60 second default value in the Java buildpack as client keep-alive durations must be longer than server keep-alive durations to avoid the race condition.

This means that the hard-coded value for keep-alive duration in the gorouter and the default value for keep-alive duration in the Java buildpack are not compatible. So that Cloud Foundry works better "out of the box", it would be great if buildpack teams could coordinate with gorouter team so that mutually-compatible values are chosen. The important thing is that whatever values are chosen, the gorouter should have a shorter keep-alive duration than the app.

Affected Versions

All gorouters with backend keep alive enabled

Context

Java buildpack issue 881 CF docs PR 199 and PR 201

Traffic Diagram

``` +----+---+ +----------+ +-------+ \o/ | | | | | | + +--->+ AWS LB +--->+ Gorouter +---->+ App | / \ | | | | | | client +--------+ +----------+ +-------+ ^^^^ issue occurs here when app closes connection just as gorouter sends next request ``` ## Steps to Reproduce You can reproduce this by creating a test app with a (for example) 100ms keep-alive timeout and writing a script that sends a [non-idempotent request](https://github.com/golang/go/blob/5c489514bc5e61ad9b5b07bd7d8ec65d66a0512a/src/net/http/request.go#L1413-L1427) (eg a POST request) via the gorouter every 100ms. You will eventually get a 502 error from the gorouter when the race condition triggers due to the gorouter sending a request to the app just as the app closes the connection. I've successfully reproduced this with a gorouter and test app running locally -- I expect you may need to tweak the durations for a gorouter running in Cloud Foundry to account for network latency. ## Possible Fix A simple fix for the java-buildpack case would just be to make the gorouter backend keep-alive duration < 60 seconds. This may still be incompatible with other buildpacks though.
plowin commented 3 years ago

Also cross-linking a similar blog about a nodejs express server behind an AWS ALB: https://adamcrowder.net/posts/node-express-api-and-aws-alb-502. One customer just recently ran into this issue with their node-server on our CF env.

plowin commented 1 year ago

@ameowlia , we are regularly receiving 502 reports due to respective misconfigurations. There is not much progress on buildpack-side. Could you support making this a topic if it would also support other community-members? See e.g. https://github.com/cloudfoundry/java-buildpack/issues/881

maxmoehl commented 1 month ago

Good news: cloudfoundry/java-buildpack#881 has been addressed with cloudfoundry/java-buildpack#1078!