Sporadic 502 errors for HTTP/2 POST/PUT requests after upgrade to 2.8.x

plowin commented 11 months ago

Description

This was observed in Cloud Foundry landscapes with the following ingress path: Client -> LB -> HAProxy -> gorouter -> app. On the frontend, HTTP/2 is enabled and apps are free to choose between HTTP/1.1 and HTTP/2. If the speaks H/2, H/2 is used towards gorouter who might to a downgrade and opens several HTTP/1.1 connection towards the respective app for a paralllel H/2 request at the frontend. With the HAProxy upgrade from 2.7.10 to 2.8.3, we started observing a failure rate of ~1% for POST and PUT requests.

Initial Analysis

In the platform logs, we observed differing status codes for HAProxy and Gorouter respectively. If gorouter logs 200 response, HAProxy might log a 502 for the same request. HAProxy also logs termination state SH--, from the docs:

     SH   The server aborted before sending its full HTTP response headers, or
          it crashed while processing the request. Since a server aborting at
          this moment is very rare, it would be wise to inspect its logs to
          control whether it crashed and why. The logged request may indicate a
          small set of faulty requests, demonstrating bugs in the application.
          Sometimes this might also be caused by an IDS killing the connection
          between HAProxy and the server.

The issue could be reproduced with 100 req/s, e.g. via:

hey -q 2 -z 1m -h2 -m POST -d "some body" https://some-test-app.<ls-domain>

Explanation: execute HTTP/2 POST requests with the given string as body for 1 minute with 2 requests per worker per second using the default worker pool of 50. This results in 100 requests per second using 50 connections for 60s = 6000 requests (if everything goes well, might be a bit less as the program hard-stops after one minute).

Countermeasures

Do not move to 2.8.x if you run a similar setup. We keep maintaining the 2.7.y-line as well.

maxmoehl commented 11 months ago

So we went through three separate stages (HAProxy version; status code; termination state: explanation):

2.7.X; 200; ----: Known-good behaviour.
2.8.3; 502; SH--: HAProxy started reporting errors due to backend issues although the backend sent a valid response code.
2.8.4; 200; SD--: Fix applied, error reporting still broken.

1. Known Good

With 2.7 we didn't observe any particular issues.

2. Broken Behaviour

With 2.8 some of the scenarios broke. We suspect that this broken behaviour was introduced with haproxy/haproxy@f2b02cfd940eec2810590ef15b37631defd11b6b but we don't know for sure. The observed behaviour was that for 0.1% to 1% of requests with a body the backend responded with a 200 but HAProxy detected some broken backend state and responded with a 502.

My current understanding is that this was previously masked because of the way the states were checked. With 2.8 there was an over-haul of error checking. The specific case is if HAProxy still has data to send (the request body) but the server already responded with a header frame which had the STREAM_END flag set. HAProxy considered this an error but the spec does allow for this, the server simply does not expect a body so it immediately sends a response and the client does not have to send all of the body.

3. Fix Applied

With haproxy/haproxy@d3e379b3ce8b0bf6bb167bb1f06268edb5406284 HAProxy still processes the response even if the server already closed the stream as the closing of the stream and the response can be a single frame.

There's probably a lot more details to this but for me this explanation is sound enough to accept that it's fixed with 2.8.4. The error reporting is still broken but that seems to depend on a pending refactor.

domdom82 commented 11 months ago

I'd say for the purpose of this issue it should be sufficient to bump to 2.8.4 and close it as the traffic is no longer hampered. We may or may not decide to roll out the change or wait for a "better" fix. However, upstream clients may still want to use 2.8 and should not be forced to wait because of a seemingly visual-only error.

plowin commented 11 months ago

Fixed in https://github.com/cloudfoundry/haproxy-boshrelease/releases/tag/v13.3.0%2B2.8.4.

As mentioned above, the new version might contain some misleading 200s with termination state SD-- which will hopefully soon be fixed.

Edit: Adding additional feedback from the HAProxy community

So the fix to delay the server abort detection to be able to send the response to the client was pushed to 2.8.x. However, the "SD--" termination state is a side effect of the current design and will require an important refactoring to be fixed. Not this specific case of course. It is only a small piece of a bigger problem. We will work on it for sure. But I don't really know when it will be available. Hopefully for the 3.0, but not sure. And unfortunately I doubt it will ever been backported to 2.8.x. Maybe some fixes will be backportable. Hard to know what for now.

capflam commented 10 months ago

Just a little update about this issue. A recent fix was pushed to haproxy-3.0-DEV. It may help to limit false SD-- reports in logs: https://github.com/haproxy/haproxy/commit/d9eb6d6680c4baccaced3668f8b0c7fc19246fd1

cloudfoundry / haproxy-boshrelease