Undici removeEventListener issue during outage

apollographql / federation

🌐 Build and scale a single data graph across multiple services with Apollo's federation gateway.

https://apollographql.com/docs/federation/

Other

669 stars 254 forks source link

Undici removeEventListener issue during outage #3099

Open chrisulloa opened 4 months ago

chrisulloa commented 4 months ago

Issue Description

Apollo outage on July 3rd caused issues communicating with uplink, forcing server crash.

We use undici as a primary fetch library for Apollo federation. We are in the process of integrating with Apollo enterprise, in our lower environment our server began to crash during a reported outage. The services recovered but our gateway crashed and impacted some testing in our lower environments. If we were using Apollo in production I would anticipate this would be a major outage for us.

Mainly asking to see if this is a known issue that has occurred for others, or if there's potentially an issue with using undici as a fetcher library.

Node version 18

/node_modules/undici/lib/web/fetch/request.js:57
      this.removeEventListener('abort', abort)
           ^
TypeError: this.removeEventListener is not a function
    at EventEmitter.abort (/node_modules/undici/lib/web/fetch/request.js:57:12)
    at EventEmitter.emit (node:events:519:28)
    at EventEmitter.emit (node:domain:488:12)
    at AbortSignal.dispatchEvent (/node_modules/node-abort-controller/index.js:28:23)
    at AbortController.abort (/node_modules/node-abort-controller/index.js:58:17)
    at Timeout._onTimeout (/node_modules/@apollo/gateway/src/supergraphManagers/UplinkSupergraphManager/loadSupergraphSdlFromStorage.ts:130:16)
    at listOnTimeout (node:internal/timers:573:17)
    at processTimers (node:internal/timers:514:7)

Looking at the error source this appears to be related to a request timeout, which calls node-abort-controller polyfill which in turn tries to call the undici removeEventListener function which doesn't exist.

https://github.com/apollographql/federation/blob/38beaa2f753eaf31be74b272ea62f3c1948c854f/gateway-js/src/supergraphManagers/UplinkSupergraphManager/loadSupergraphSdlFromStorage.ts#L130

https://github.com/southpolesteve/node-abort-controller/blob/10e0cea66a069d9319f948d055621e1d37aea5db/index.js#L16

Link to Reproduction

N/A

Reproduction Steps

Service Dependencies:

    "dependencies": {
        "@apollo/gateway": "^2.8.3",
        "@apollo/server": "^4.10.4",
        "@aws-sdk/client-ssm": "^3.577.0",
        "body-parser": "^1.20.2",
        "cors": "^2.8.5",
        "dd-trace": "^5.0.0",
        "express": "^4.19.2",
        "fs": "0.0.1-security",
        "global-agent": "^3.0.0",
        "http-status-codes": "^2.1.4",
        "jsonwebtoken": "^9.0.0",
        "lodash": "^4.17.21",
        "nodejs-base64": "^2.0.0",
        "superagent": "^8.0.0",
        "swagger-ui-express": "^5.0.0",
        "undici": "^6.16.1",
        "winston": "^3.8.1",
        "ws": ">=6.2.2",
        "yamljs": "^0.3.0"
    },

Gateway init:

import { fetch } from 'undici'
  const gateway = new ApolloGateway({
    fetcher: fetch,
    logger: appLogger,
  });

Node version: 18

chrisulloa commented 3 months ago

I was able to collect a little bit more information here and was able to recreate the issue locally.

I created a fresh Apollo Gateway using minimal dependencies above. I created a test server that returns the supergraph sdl successfully on first call, then times out on subsequent calls.

let called = false;
app.use("/uplink-test", async (req, res) => {
  if (!called) {
    called = true;
    console.log("Sending valid response");
    const data = await fs.readFile(
      "/Users/user/test-server/source/supergraph.graphql",
      {
        encoding: "utf-8",
      }
    );

    res.status(200);
    res.send({
      data: {
        __typename: "Query",
        routerConfig: {
          __typename: "RouterConfigResult",
          id: "123",
          supergraphSdl: data,
        },
      },
    });
  } else {
    console.log("Starting timeout");
    await timeout(60000);
    res.status(200);
    res.send({});
  }
});

My Apollo gateway starts up successfully on first call, but on subsequent calls it breaks on timeout.

import { fetch } from "undici";
const gateway = new ApolloGateway({
  uplinkEndpoints: ["http://localhost:8111/uplink-test"],
  debug: true,
  fetcher: fetch
});

/node_modules/undici/lib/web/fetch/request.js:57
      this.removeEventListener('abort', abort)
           ^
TypeError: this.removeEventListener is not a function
    at EventEmitter.abort (/node_modules/undici/lib/web/fetch/request.js:57:12)
    at EventEmitter.emit (node:events:519:28)
    at EventEmitter.emit (node:domain:488:12)
    at AbortSignal.dispatchEvent (/node_modules/node-abort-controller/index.js:28:23)
    at AbortController.abort (/node_modules/node-abort-controller/index.js:58:17)
    at Timeout._onTimeout (/node_modules/@apollo/gateway/src/supergraphManagers/UplinkSupergraphManager/loadSupergraphSdlFromStorage.ts:130:16)
    at listOnTimeout (node:internal/timers:573:17)
    at processTimers (node:internal/timers:514:7)

When I remove undici the server starts up correctly, and on subsequent calls to the endpoint it times out and does not crash

const gateway = new ApolloGateway({
  uplinkEndpoints: ["http://localhost:8111/uplink-test"],
  debug: true
});

2024-08-01 11:38:01.064 ERROR - UplinkSupergraphManager failed to update supergraph with the following error: An error occurred while fetching your schema from Apollo: The user aborted a request.

I think it might be fair to say that Undici is broken as a fetcher for Apollo Gateway due to the removeEventListener missing. If the uplink begins to timeout for whatever reason it will crash the gateway if Undici is used as a fetcher.

squirrelchew commented 3 months ago

Yep, I'm seeing a similar issue. Our setup is a bit more complex though -- node20, AWS lambdas, supergraph behind an AWS API gateway.

If the lambda is kept alive via 3-minutely requests, after usually 15 minutes the AWS APIG starts returning a 502 as the supergraph crashes.

It crashes on the usage reporting up to apollo, so I removed that. Now it crashes on the same function here, loadSupergraphSdlFromStorage.

I thought it had to do with the lambda filesystem being ephemeral and being kept hot for an extended priod, given it's reading whatever from presumably local storage, I guess it might not be.

The weird part about my version of this error is that the default timeout of 30 seconds is not being hit. AWS cloudwatch logs show the failure happening milliseconds after the request comes in.

This is happening in our production instance, and has been for quite some time -- a coworker suspects it has to do with whatever got sunset on or around June 27th, but our logs don't go back that far.

Since node-abort-controller is unnecessary after ~~node16~~ node14, would perhaps using the native AbortController() work?

squirrelchew commented 3 months ago

oop, I wrote this in response to a post that got deleted --> As it stands we're explicitly passing in the node native fetch. Looks like apollo is using 2.7.0, I can import that and pass it in explicitly instead, sure. I'll run some tests today.

In the meantime I altered the apollo code in the node_modules directory to just use the native AbortController() and ran my 3-minute poll for 5 hours without failures. So, there's that. Gotta do some more testing.

chrisulloa commented 3 months ago

@squirrelchew I think you might have some luck explicitly passing in node-fetch 2.7.0, but I ran into this issue with undici not native node fetch.

squirrelchew commented 3 months ago

Hmm, I was under the impression node18+ uses undici under the hood.

Anways, my lambda timeouts were a bit too short. Bumped those up and errors are being exposed a little better.

With node-fetch as the fetcher for the server + gateway, apollo's code untouched so it uses node-abort-controller, the supergraph straight up fails to start:

ERROR   UplinkSupergraphManager failed to update supergraph with the following error: An error occurred while fetching your schema from Apollo: Expected signal to be an instanceof AbortSignal
ERROR   An error occurred during Apollo Server startup. All GraphQL requests will now fail. The startup error was: An error occurred while fetching your schema from Apollo: Expected signal to be an instanceof AbortSignal
ERROR   An error occurred during Apollo Server startup. All GraphQL requests will now fail. The startup error was: An error occurred while fetching your schema from Apollo: Expected signal to be an instanceof AbortSignal

going back to passing in node's native fetch works at first

This really points me at node-abort-controller to me. To my inexperienced eyes, the native AbortController and node-abort-controller's implementation of it are distinctly different at runtime:

but drilling through what I can find here they do ultimately appear to be the same, so I'm at a loss there. This big scary comment block seems to have some level of relevance, too.

I'm unconvinced the fetcher itself is the problem in our case here, outside of using anything but node's native fetch at least. It really appears to be the usage of node-abort-controller which, again, openly states you should not need it after node14 because node has AbortController built-in globals.

testing so far:

node-fetch w/default apollo abort controller -- immediate catastrophic failures, graph schema fails to fetch+compile, startup error states "Expected signal to be an instanceof AbortSignal"
- subsequent requests complain about no valid configuration
no explicit fetcher w/default apollo abort controller -- immediate errors just like node-fetch
- this appears to use make-fetch-happen
- subsequent requests complain about no valid configuration
fetch w/default apollo abort controller -- works at first, but 30-33 minutes later received the "this.removeEventListener is not a function" error from the usageReporting plugin, which specifically calls the line using node-abort-controller. I suspect this would also fail on the loadSupergraphSdlFromStorage call, as it has before when removing the usage plugin, as it has the same AbortController pattern from node-abort-controller.

Next up is to re-remove the node-abort-controller stuff from Apollo and test the various fetchers again.

chrisulloa commented 3 months ago

@squirrelchew I ran into this error too using node-fetch, the issue was esbuild minification. You could use --keep-names to avoid breakage with minfication that breaks node-fetch. If you minify using another library try to find a way to preserve function names because node-fetch tries to reach into AbortSignal using the name

squirrelchew commented 3 months ago

Our esbuild minify flag is set to false in our serverless yaml:

  esbuild:
    bundle: true
    minify: false
    sourcemap: true
    exclude: ['aws-sdk']
    target: 'node20'
    define: { 'require.resolve': undefined }
    platform: 'node'
    concurrency: 10

this is pretty standard procedure here; this portion of the yaml has been passed around since node12 deployments as far as I can see.

I've been running a test against the default fetcher for the last 1.5 hours without issue, which uses node-fetch inside of these plugins that otherwise fail. The difference is that I removed all references to the node-abort-controller package in the apollo code that I have deployed.

I'll try node-fetch directly in a little bit here and report back, but I have a suspicion it'll work just fine since the undefined/default fetcher is working great so far!

There are just so many moving parts in my case though, what with the lambdas and APIG. I need to simplify this into something I can reproduce locally, as you have above. I'll likely get to work on this while some of these other tests are running in the AWS deployment.

chrisulloa commented 3 months ago

@squirrelchew yeah, i think ultimately it comes down to perhaps Apollo not properly handling timeouts in the instance of undici vs node-fetch vs native fetch, and issues with the AbortSignal / polyfill

squirrelchew commented 3 months ago

It certainly feels that way. I don't feel good about opening a PR or whatever in public here to add a handler for it, so for the time being I'm going to fork apollo, add handling for using the native AbortController in the case of node16+, and see where that gets us.

if that fixes it and I can prove that it fixes it, it's a very simple change assuming parsing process.version is a kosher thing to do!

squirrelchew commented 3 months ago

Alright, so changing the import of node-abort-controller to:

import NodeAbortController from 'node-abort-controller'

and then changing the instantiation of it to:

const nodeMajor = parseInt(process.versions.node.split('.')[0], 10)
const controller = nodeMajor >= 16 ? new AbortController() : new NodeAbortController.AbortController();

Seems to have fixed our problem after some 8 hours of testing. This exists in a few places in @apollo/server and one place in @apollo/gateway

It seems fine with both the default apollo fetcher as well as explicitly passing in node's fetch.

There's a type import for the signal from node-abort-controller in the drainHttpServer plugin that I'm not sure how to handle, but it appears unused in our case so that's fine enough.