apollographql / federation

🌐  Build and scale a single data graph across multiple services with Apollo's federation gateway.
https://apollographql.com/docs/federation/
Other
659 stars 248 forks source link

Apollo Server socket connection issue when used with Apollo Federation (and make-fetch-happen) #2313

Open pcoulso1 opened 1 year ago

pcoulso1 commented 1 year ago

Issue Description

I am seeing a Socket Hung up issues occurring between the Gateway and the Microservices, when using the ApolloServer in both Gateway and Microservices components, within Apollo Federation 2.

It appears to be a race condition but is happening frequently enough as to de-stabilize the connection between underlying microservice and the gateway.

The issue is reported in the gateway as;

{"level":50,"time":1671446399521,"pid":16,"hostname":"xxxxx",
"name":"service-gateway@2.0.1",
"message":"request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"extensions":{"code":"INTERNAL_SERVER_ERROR",
"stacktrace":["FetchError: request to https://microservice.apps.domain.net/graphql failed, reason: socket hang up",
"    at ClientRequest.<anonymous> (/app/node_modules/minipass-fetch/lib/index.js:130:14)",
"    at ClientRequest.emit (node:events:513:28)",
"    at ClientRequest.emit (node:domain:489:12)",
"    at TLSSocket.socketOnEnd (node:_http_client:518:9)",
"    at TLSSocket.emit (node:events:525:35)",
"    at TLSSocket.emit (node:domain:489:12)",
"    at endReadableNT (node:internal/streams/readable:1358:12)",
"    at processTicksAndRejections (node:internal/process/task_queues:83:21)"]}}

On investigation the issues appears to be similar to the one reported in https://medium.com/ssense-tech/reduce-networking-errors-in-nodejs-23b4eb9f2d83, where there is a race condition between the client and the server closing the connection.

The issues seems to be related to the "freeSocketTimeout" implemented within the 'agentkeepalive' agent (which is used by 'make-fetch-happen'). In the 'make-fetch-happen', freeSocketTimeout is hard coded to 15000, (see https://github.com/npm/make-fetch-happen/blob/main/lib/agent.js#L80) but by default the server side socket keep alive timeout is set to just 5000 (see https://nodejs.org/docs/latest-v16.x/api/http.html#serverkeepalivetimeout). So there is a window of opportunity for the server to close the socket but the client still consider it as a usable socket.

This results in the “socket hang up” or ECONNRESET errors being reported at about 5000ms of ideal time on the socket.

The version of NPM and NODEJS which I am using is;

The following is an output of the attached reproducer app, which simply uses ApolloServer as the server and then calls make-fetch-happen to handle the calls into the server. As you can see these fail about 50% of the time (when using a 4999ms delay)

{
  successes: 26,
  failures: {
    'request to https://localhost:4001/graphql failed, reason: read ECONNRESET': 24
  }
}

I’ve also included two possible workarounds;

I suspect the real fix should be to get 'make-fetch-happen' to make freeSocketTimeout configurable and then set it to an appropriate value

Please advise what the long term fix should be for this issue.

Link to Reproduction

https://github.com/pcoulso1/connection-test

Reproduction Steps

Clone the repo and run the following command lines

npm install
npm run start:dev
glasser commented 1 year ago

Migrating this to federation as this is primarily about Gateway, though it is possible that one response might include improving documentation to encourage Apollo Server subgraph developers to increase their keepAliveTimeout. More thoughts to come later once I've checked with some folks who might have good ideas. Really great reproduction — thanks! One note is that IIUC make-fetch-happen's retry feature (which we do disable) does retry ECONNRESET by default... but not for POSTs. I suspect that we should just retry ECONNRESET for non-mutations in Gateway. It's also quite possible that Apollo Router handles this better than Node/make-fetch-happen does!

pcoulso1 commented 1 year ago

I think the root of this issue is with the make-fetch-happen library which has hardcoded freeSocketTimeout, therefore I've raised the following bug report at https://github.com/npm/make-fetch-happen/issues/199

glasser commented 1 year ago

Yeah, though I could understand if their response was "the proper way to configure HttpAgent parameters is to override the agent, not to copy all of HttpAgent's options into make-fetch-happen". Though... the mechanisms they give to override the agent only let you set one fixed agent rather than use their per-host agent logic, right?