caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
55.94k stars 3.93k forks source link

Reverse proxy upstream keepalive #6452

Open joanrieu opened 2 weeks ago

joanrieu commented 2 weeks ago

Hello, thanks for Caddy, it's a pleasure to use it for its easy config and its "it just works" approach.

Issue

I've spent hours trying to understand why Caddy was randomly returning 502s for some proxied calls to my Node.js backend. It was logging the following message which made me pointlessly look for issues in my network stack:

read tcp #.#.#.#:#->#.#.#.#:#: read: connection reset by peer

Resolution

I have finally identified the root cause which is that the default http transport keepalive duration setting in Caddy (2 minutes) widely exceeds the default Node.js HTTP server keepAliveTimeout setting (5 seconds).

Improvements

I believe it would be worth doing the following:

  1. making Caddy reopen a connection instead of returning a 502 to the client (turning the problem into a warning instead of an error)
  2. pointing out in the documentation that the keepalive duration must be below the upstream server's own keepalive timeout
  3. pointing out more prominently in the documentation that Caddy enables a keepalive by default (should that default be revisited?)
  4. improving the log in case the connection that was kept alive happens to be closed

What do you think?

Rationale

This issue is not specific to Caddy and you can find many articles online of people spending lots of time on similar issues for each proxy out there that has a long keepalive enabled by default. So it would fit Caddy's philosophy to provide something that works better out-of-the-box.

This issue tends to appear for small apps that suddenly have more users. Backends that get very little activity hit Caddy's timeout (which hides the problem) while the ones with a lot of activity never hit Node's timeout (which may hide the problem as well). But for a small server which sees a few bursts of activity, suddenly a lot of unexplainable 502s pop up as many calls fall in the 5sec < x < 2min gap.

mholt commented 2 weeks ago

We don't really deal with keepalives ourselves, we just pass the value onto the Go standard library's HTTP transport: https://pkg.go.dev/net/http#Transport

To be clear, you're talking about HTTP keepalives, right? Not TCP keepalive?

Before making any changes, I'd like to reproduce the issue as minimally as possible to understand it. Do you have a minimal reproducer? Ideally, if the backend can also be Caddy, that will help me eliminate Node as a possible source of confusion/bugs/problems.

joanrieu commented 2 weeks ago

Thanks for the feedback. I've tried reproducing locally and managed to get a handful of 502s but it's hard as it's a timing issue, more so than I thought.

After further reading on the topic, the cause is slightly different from what I assumed above and is apparently well-known from HTTP client implementors, it's a race condition between the moment the server closes the idle socket and the moment the client detects it (so in our example above, that means not requests in the range 5sec < x < 2min but more like 5sec < x < 5.1sec).

It seems to be a known issue in all HTTP clients, one that doesn't have a clean solution. Go's library does not have a heuristic to detect that situation and does what the HTTP spec actually mandates in more general cases: only idempotent requests verbs are retried, and in my case POST requests without idempotency headers are excluded and therefore fail. Other clients like browsers tend to retry in all cases instead, going against the spec to some extent.

The issue has been reported/closed for Go's HTTP lib in https://github.com/golang/go/issues/22158, which brought support for idempotency headers.

I think it's still worth thinking about a way to state that in the docs, as it does cause issues and the root cause is hard to figure out. Something along the lines of:

HTTP keep-alive is enabled by default with a long timeout. If your upstream server is configured to close idle connections with a shorter delay, HTTP 502s might be returned on some non-idempotent requests.

Further reading:

mholt commented 2 weeks ago

I think it's still worth thinking about a way to state that in the docs, as it does cause issues and the root cause is hard to figure out. Something along the lines of

That's a great idea. Thanks for looking deeply into this, it would have driven me bonkers too.

So you're saying that a POST (for example) request, with an Idempotency-Key header (for example) would be retried?

We can add that to the docs. I agree it's worth it.