Better handling of arangojs' connection pool to avoid "socket hang up?"

radicaled commented 1 year ago

We've had a long-running issue of seeing "socket hang up" errors when using arangojs. They didn't happen all the time; it was pretty sporadic. It was eventually tracked down to a combination of factors:

Enabling HTTP keep-alive via agentOptions.keepAlive
The default scheduling strategy for agentOptions.scheduling being lifo
ArangoDB's --http-keep-alive-timeout being set to 300 seconds (see https://www.arangodb.com/docs/stable/programs-arangod-options.html#--httpkeep-alive-timeout)
Having agentOptions.maxSockets set to 32 (we're using arangojs in the context of a GraphQL server)

This combination meant that if we didn't use a socket for 300 seconds (a realistic possibility with a 32 sockets pool), one of the sockets would be disconnected and arangojs wouldn't know until it tried to make a request using it.

The quick fix was to disable HTTP keep-alive via agentOptions.keepAlive = false, however that increased response latency significantly (sometimes doubling it). The second thing we investigated was trying to use HTTP2, but arangojs basically expects the http module's Agent, so that would have been too much surgery and would have had questionable forward compatibility.

What we've done for now is is set agentOptions.scheduling to fifo. According to the documentation (https://nodejs.org/api/http.html#http_new_agent_options), this defaults to lifo which means that some sockets may not be used. Thus, during a period of idle activity, these connections can be dropped by ArangoDB. And then, during a period of higher activity, the agent will try to use one of these dropped sockets, then bang: Socket hang up! But, with fifo, even during periods of low activity, most of our use-case has us making enough requests to ArangoDB so that a socket is never considered idle, so it never gets terminated by ArangoDB's keep-alive timeout.

So, setting the scheduling option to fifo seems to work but kind of kicks the can down the road. IE, if we had a background process that slept or only did work sporadically, it is possible to go > 5 minutes without making a database request, thus bringing us back to ArangoDB hanging up on us.

I'm not too familiar with arangojs or the internals of the http module's Agent class, but is there a better way of handling this type of error? Can we manually terminate idle sockets after a period of time (matching ArangoDB's --http-keep-alive-timeout), or do a NO-OP request to the server using one of those idle sockets to make sure there's no disconnections?

radicaled commented 1 year ago

So, setting the scheduling option to fifo seems to work but kind of kicks the can down the road. IE, if we had a background process that slept or only did work sporadically, it is possible to go > 5 minutes without making a database request, thus bringing us back to ArangoDB hanging up on us.

For now we're using the following hack to keep each connection alive while our background services are running:

      // Prevent every keep-alive connection from going idle.
      // Setting ConnectionOptions.agentOptions.strategy to `fifo` ensures this code will be invoked for every connection in the pool
      // 32 sockets * 5s interval = 160s. ArangoDB's default keep-alive timeout is 300s.
      const checkDatabaseTimer = setIntervalAsync(async () => {
        await db.exists().catch(e => {
          logger.warn(`ArangoDB connection check failed: ${e.message}`);
        });
      }, 5 * 1000);

pluma4345 commented 5 months ago

The upcoming 9.0.0 release replaces http.request/xhr with native fetch. This changes how network requests are issued, which may solve this issue. Can you please try the pre-release version by installing arangojs@next and see if that fixes your problem.

radicaled commented 2 months ago

Sorry for the late reply.

We rely on some settings native to the node.js HttpAgent that don't seem to have an analogue in the node.js fetch implementation (undici), so I can't really test this.

We'll probably be staying on a pre-9.x version of arangojs for as long as we use ArangoDB.

pluma4345 commented 2 months ago

@radicaled There's a workaround to modify the agent used by Node.js fetch: https://github.com/arangodb/arangojs?tab=readme-ov-file#nodejs-with-self-signed-https-certificates

I'm closing this issue then. Feel free to reopen this if the problem occurs in v9.

arangodb / arangojs

Better handling of arangojs' connection pool to avoid "socket hang up?" #791