arcjet / arcjet-js

Arcjet JS SDKs. Bot detection, rate limiting, email validation, attack protection, data redaction for Node.js, Next.js, Deno, Bun, SvelteKit, NestJS.
https://arcjet.com
Apache License 2.0
267 stars 7 forks source link

HTTP2 connection not being reset when error received #1401

Open davidmytton opened 2 months ago

davidmytton commented 2 months ago

A user is receiving various HTTP2 errors e.g. {"error":{"message":"[internal] Stream closed with error code NGHTTP2_ENHANCE_YOUR_CALM"}} and {"error":{"message":"[internal] Stream closed with error code NGHTTP2_INTERNAL_ERROR"}} for every call to decide, which is then being logged via report.

I'm unsure about the cause of this, but it may be a timeout or error on a previous request. This is referenced in various places, but all suggest fixes have already gone out:

The only way to resolve this is to restart the Node process, which isn't ideal. When we get these errors, can we re-establish the connection?

blaine-arcjet commented 2 months ago

This is a message the load balancer sends back to the client. It can be handled in the client via killing the entire session and reopening from scratch, but we don't have control over that with ConnectRPC. This is something we can revisit when we write our own session manager, but in the meantime @davidmytton can look into the load balancer configuration.

davidmytton commented 2 months ago

The load balancer isn't under our control - it's from Fly.io. I've opened a support case with them.

davidmytton commented 2 months ago

Fly have now confirmed their proxy may return NGHTTP2_ENHANCE_YOUR_CALM in some circumstances. I'm still working with them to figure out why. I expect it's to do with the network jitter and tail latency we're seeing in yul and yuz whereby enough timeouts cause it to hit an error threshold.

davidmytton commented 2 months ago

We have had the high tail latency (network jitter) support case open with Fly.io for over a week. The tail latency causes internal slowdowns in our API which results in a timeout on the client, and I think goes back to the Fly proxy as a reset. Given enough timeouts / resets (1024 by default), this causes the Fly proxy to trigger ENHANCE_YOUR_CALM (here) which node reports as NGHTTP2_ENHANCE_YOUR_CALM.

This is what hyperium/h2 describes at:

https://github.com/hyperium/h2/blob/90359ba6d38843b106967a6ac9419a500ea26873/src/server.rs#L892

However, that error isn't being handled properly by the client - it continues to send requests on the same connection. What we see is the Arcjet SDK client connects to our API endpoint on Fly, Fly proxy proxies it our app, the API processes the request normally and returns the response, the client gets NGHTTP2_ENHANCE_YOUR_CALM. It never sees the response we send.

Some additional notes:

https://github.com/connectrpc/connect-es/blob/dc7d75a1eaaac12a06b4dcffd0cfd370c5432648/packages/connect-node/src/node-universal-handler.spec.ts#L122

https://github.com/nodejs/node/blob/2f0b3713efd1f8f06a38115393cee8531803ce43/lib/internal/errors.js#L1307

https://github.com/connectrpc/connect-es/blob/dc7d75a1eaaac12a06b4dcffd0cfd370c5432648/packages/connect-node/src/node-error.ts#L22

Stream tracking

Aug 31, 2024 @ 19:59:04.147141000
received Reset { stream_id: StreamId(1306619), error_code: ENHANCE_YOUR_CALM }

Aug 31, 2024 @ 19:59:04.146676000
send Data { stream_id: StreamId(1306619), flags: (0x1: END_STREAM) }

Aug 31, 2024 @ 19:59:04.146645000
send Headers { stream_id: StreamId(1306619), flags: (0x4: END_HEADERS) }

Aug 31, 2024 @ 19:59:04.142444000
received Data { stream_id: StreamId(1306619), flags: (0x1: END_STREAM) }

Aug 31, 2024 @ 19:59:04.141560000
received Data { stream_id: StreamId(1306619) }

Aug 31, 2024 @ 19:59:04.141500000
received Headers { stream_id: StreamId(1306619), flags: (0x4: END_HEADERS) }