Open davidmytton opened 2 months ago
This is a message the load balancer sends back to the client. It can be handled in the client via killing the entire session and reopening from scratch, but we don't have control over that with ConnectRPC. This is something we can revisit when we write our own session manager, but in the meantime @davidmytton can look into the load balancer configuration.
The load balancer isn't under our control - it's from Fly.io. I've opened a support case with them.
Fly have now confirmed their proxy may return NGHTTP2_ENHANCE_YOUR_CALM
in some circumstances. I'm still working with them to figure out why. I expect it's to do with the network jitter and tail latency we're seeing in yul
and yuz
whereby enough timeouts cause it to hit an error threshold.
We have had the high tail latency (network jitter) support case open with Fly.io for over a week. The tail latency causes internal slowdowns in our API which results in a timeout on the client, and I think goes back to the Fly proxy as a reset. Given enough timeouts / resets (1024 by default), this causes the Fly proxy to trigger ENHANCE_YOUR_CALM
(here) which node reports as NGHTTP2_ENHANCE_YOUR_CALM
.
This is what hyperium/h2 describes at:
https://github.com/hyperium/h2/blob/90359ba6d38843b106967a6ac9419a500ea26873/src/server.rs#L892
However, that error isn't being handled properly by the client - it continues to send requests on the same connection. What we see is the Arcjet SDK client connects to our API endpoint on Fly, Fly proxy proxies it our app, the API processes the request normally and returns the response, the client gets NGHTTP2_ENHANCE_YOUR_CALM
. It never sees the response we send.
Some additional notes:
[internal] Stream closed with error code NGHTTP2_INTERNAL_ERROR
NGHTTP2_ENHANCE_YOUR_CALM
error should reset the connection. However, it looks like https://github.com/nodejs/undici/issues/2675 may prevent that.Stream tracking
Aug 31, 2024 @ 19:59:04.147141000
received Reset { stream_id: StreamId(1306619), error_code: ENHANCE_YOUR_CALM }
Aug 31, 2024 @ 19:59:04.146676000
send Data { stream_id: StreamId(1306619), flags: (0x1: END_STREAM) }
Aug 31, 2024 @ 19:59:04.146645000
send Headers { stream_id: StreamId(1306619), flags: (0x4: END_HEADERS) }
Aug 31, 2024 @ 19:59:04.142444000
received Data { stream_id: StreamId(1306619), flags: (0x1: END_STREAM) }
Aug 31, 2024 @ 19:59:04.141560000
received Data { stream_id: StreamId(1306619) }
Aug 31, 2024 @ 19:59:04.141500000
received Headers { stream_id: StreamId(1306619), flags: (0x4: END_HEADERS) }
A user is receiving various HTTP2 errors e.g.
{"error":{"message":"[internal] Stream closed with error code NGHTTP2_ENHANCE_YOUR_CALM"}}
and{"error":{"message":"[internal] Stream closed with error code NGHTTP2_INTERNAL_ERROR"}}
for every call to decide, which is then being logged via report.I'm unsure about the cause of this, but it may be a timeout or error on a previous request. This is referenced in various places, but all suggest fixes have already gone out:
The only way to resolve this is to restart the Node process, which isn't ideal. When we get these errors, can we re-establish the connection?