Federated gateway - 500 error on whole graph if one underlying graph/service is down

saul-data commented 4 years ago

Hi there, I have been trying out federated gateway and I am worried about the entire service going down if one graph is not running.

If you have a number of micro graphql services underneath that can cause a lot of problems. I tried turning off / on services while starting up federated gateway. It should really pick up the services more gracefully.

For example: the below caused the entire graphql to result in 500 internal error (instead of just reporting an error that one service is down and allowing all the other services to operate as normal. It should rather be an online/offline kind of thing.

Error checking for changes to service definitions: Couldn't load service definitions for "testgraphLocal" at http://test-graph:8099/query: request to http://test-graph:8099/query failed, reason: getaddrinfo ENOTFOUND test-graph This data graph is missing a valid configuration. Couldn't load service definitions for "testgraphLocal" at http://test-graph:8099/query: request to http://test-graph:8099/query failed, reason: getaddrinfo ENOTFOUND test-graph

KJulien2 commented 4 years ago

+1

Maybe Managed federation is able to better manage offline services: https://www.apollographql.com/docs/studio/managed-federation/overview/

MrSaints commented 4 years ago

+1 it should also retry in background if it fails.

erdaltsksn commented 4 years ago

+1

Bhoomikapanwar commented 3 years ago

Adding +1 to this issue.

Also, adding some debugging notes here:

Scenario: In cases where we have multiple queries in a single operation, which relates to two services let's say A and B; the gateway creates a query plan something like this Parallel { Fetch A Fetch B }

And if any of the service is down, the gateway would fail the entire operation.

Findings:

In case of parallel fetch execution, the gateway expects all the promises (fetches) to resolve, and in case any of the promise rejects it fails the entire operation. So when on of the service is down, it throws a fetch error and that exception is thrown all way up, which results in rejection of promise and hence the entire operation.

What we need to have is exception handling at this place --> https://github.com/apollographql/federation/blob/main/gateway-js/src/datasources/RemoteGraphQLDataSource.ts#L170, or somewhere above the chain so that its gracefully handled and results in promise resolution.

Alfarlost commented 3 years ago

+1

ccmcbeck commented 3 years ago

This issue is exacerbated by Kubernetes. When we do our bi-weekly PROD cluster roll (which takes about an hour), we see this error on about 30% of our requests. Our graphql gateway service averages about 25 pods and federates 10 microservices which average ~10 pods each. When Kubernetes terminates a microservice pod in the midst of federation, the gateway pod returns this 500 error and remains in an "unhealthy" state.

This also happens when we scale down after a spike in traffic as Kubernetes terminates the excess pods.

We realize that polling microservices in a dynamic environment like Kubernetes will always be prone to this type of error. One mitigation strategy might be a single, reflexive retry of the failing federation request. An additional mitigation strategy is for the failing gateway pod to "fail its health check" and trigger Kubernetes to restart the pod. We know these strategies won't eliminate the issue, but it would reduce it to a manageable level.

Of course, the real solution is to use a graphql registry. Unfortunately, the current Apollo enterprise pricing model puts this outside our budget. We understand the Apollo business model is to provide enough open-source functionality to get us hooked on the platform (we love you, Apollo, so mission accomplished) and for us to eventually upgrade to a paid platform when we reach scale (which we want to do). We also appreciate the fact that Apollo may not be ready to cannibalize their enterprise business by offering lower-tier or a la carte pricing for folks like us.

So, this leaves us with these alternatives:

Write our own federation service (which we did before Apollo came out with their federation)
Work with the Apollo open-source community to make the current Federation more resilient
Petition Apollo to provide a solution for small-to-mid-tier companies that cannot afford enterprise pricing
Look elsewhere for a more budget-friendly registry solution

And these observations:

We really don't want to do 1
Apollo shouldn't want us to do 4
What is needed is some combination of 2 and 3

Full disclosure -- we knew this day would come. Our bet was that Apollo would saturate the enterprise market and begin to go down-market before we scaled to a level where we needed a dedicated registry. Experience tells me there are quite a few companies like ours that are facing this challenge.

davidpickavance commented 3 years ago

It sounds like you would get what you need via managed federation, which is available on the free tier of Apollo GraphQL (as well as the paid tiers)

Is that what you are looking for?

michael-watson commented 3 years ago

You can also provide your own way to pull together a configuration for the gateway using updateServiceDefinitions (see here). This was designed to be used for environments that are locked down.

ccmcbeck commented 3 years ago

It sounds like you would get what you need via managed federation, which is available on the free tier of Apollo GraphQL (as well as the paid tiers)

Is that what you are looking for?

Thanks for the response. I'll make some enquiries about this with our Apollo rep.

ccmcbeck commented 3 years ago

@davidpickavance, managed federation is what we want, but at an entry level price of $50k per year, it's outside our budget. It's bundled with a bunch of services we don't need and Apollo is not quite ready for a la carte pricing.

ayushnawani commented 3 years ago

@ccmcbeck @saul-gush did you get any solution for this issue? were you able to use updateServiceDefinitions or any other workaround? we are also facing the same issue of keeping the gateway up if one service is down.

saul-data commented 3 years ago

@ayushnawani Nope never found a solution and Im looking into what alternatives there could be. We still find problems where the gateway unexpectedly hangs so we need to manually restart our Kubernetes pods. I am also worried about the speed, we are seeing the underlying Golang graphql services at 9ms but the gateway is adding on a good 100 to 300ms in some cases.

ferwasy commented 3 years ago

At Galley Solutions we found a way to solve this by overriding the ApolloGateway.loadServiceDefinitions()method, loading each service definition individually and then composing a schema with the available underlying services only:

async loadServiceDefinitions(
    config: RemoteGatewayConfig | ManagedGatewayConfig
  ): ReturnType<Experimental_UpdateServiceDefinitions> {
    if (isRemoteConfig(config)) {
      let isNewComposedSchema = true;
      const loadedServiceDefinitions: ServiceDefinition[] = [];
      const serviceDefinitionsLoadErrors: ServiceDefinitionLoadError[] = [];
      // Iterate through each service definition
      for (const serviceDefinition of config.serviceList) {
        // Create an individual service config with the current service only
        const individualServiceConfig = {
          ...config,
          serviceList: [serviceDefinition]
        };
        try {
          // Try to load the individual service definition
          const loadedServiceDefinition = await super.loadServiceDefinitions(
            individualServiceConfig
          );
          if (!isNil(loadedServiceDefinition.serviceDefinitions)) {
            // Store the individual loaded service definition
            loadedServiceDefinitions.push(
              loadedServiceDefinition.serviceDefinitions[0]
            );
            isNewComposedSchema =
              isNewComposedSchema || loadedServiceDefinition.isNewSchema;
          }
        } catch (error) {
          // This is the key. Store the error and the associated service definition,
          // but DO NOT THROW. So we can build a schema with the available services.
          // If we throw, the whole graph will be unavailable.
          serviceDefinitionsLoadErrors.push({
            serviceDefinition,
            error: error.message
          });
        }
      }
      if (!isEmpty(serviceDefinitionsLoadErrors)) {
        // Log all the service definitions and potentially do something else with them
        for (let serviceDefinitionLoadError of serviceDefinitionsLoadErrors) {
          logger.error(serviceDefinitionLoadError.error);
        }
      }
      // Return the composed schema made of individual loaded service definitions.
      const composedSchema = {
        serviceDefinitions: loadedServiceDefinitions,
        isNewSchema: isNewComposedSchema
      };
      return composedSchema;
    } else {
      return super.loadServiceDefinitions(config);
    }
  }

kodeine commented 3 years ago

@ferwasy can we add a retry in this as well?

ferwasy commented 3 years ago

@kodeine Yes, sure. There are several retry strategies that can be implemented.

ccmcbeck commented 3 years ago

@ayushnawani, we reverted to a static schema on one of our GraphQL gateways. It's not as bad as it sounds, but it's not optimal. We actually have 2 Backend For Frontend (BFF) GraphQL gateways:

BFF for Mobile Apps that is high traffic but a thin slice of our GQL schema. This is now static so there is no runtime problem.
BFF for Internal Web Apps uses full GQL schema. This is still dynamic and we deal with the runtime problem by restarting.

But https://github.com/ferwasy solution is interesting.

StarpTech commented 3 years ago

@ccmcbeck we were also aware of that issue and created https://github.com/StarpTech/graphql-registry It'd be great to get feedback.

Apollo example: https://github.com/StarpTech/graphql-registry/tree/main/examples/apollo-federation

ghiyaom commented 3 years ago

At Galley Solutions we found a way to solve this by overriding the ApolloGateway.loadServiceDefinitions()method, loading each service definition individually and then composing a schema with the available underlying services only:

async loadServiceDefinitions(
    config: RemoteGatewayConfig | ManagedGatewayConfig
  ): ReturnType<Experimental_UpdateServiceDefinitions> {
    if (isRemoteConfig(config)) {
      let isNewComposedSchema = true;
      const loadedServiceDefinitions: ServiceDefinition[] = [];
      const serviceDefinitionsLoadErrors: ServiceDefinitionLoadError[] = [];
      // Iterate through each service definition
      for (const serviceDefinition of config.serviceList) {
        // Create an individual service config with the current service only
        const individualServiceConfig = {
          ...config,
          serviceList: [serviceDefinition]
        };
        try {
          // Try to load the individual service definition
          const loadedServiceDefinition = await super.loadServiceDefinitions(
            individualServiceConfig
          );
          if (!isNil(loadedServiceDefinition.serviceDefinitions)) {
            // Store the individual loaded service definition
            loadedServiceDefinitions.push(
              loadedServiceDefinition.serviceDefinitions[0]
            );
            isNewComposedSchema =
              isNewComposedSchema || loadedServiceDefinition.isNewSchema;
          }
        } catch (error) {
          // This is the key. Store the error and the associated service definition,
          // but DO NOT THROW. So we can build a schema with the available services.
          // If we throw, the whole graph will be unavailable.
          serviceDefinitionsLoadErrors.push({
            serviceDefinition,
            error: error.message
          });
        }
      }
      if (!isEmpty(serviceDefinitionsLoadErrors)) {
        // Log all the service definitions and potentially do something else with them
        for (let serviceDefinitionLoadError of serviceDefinitionsLoadErrors) {
          logger.error(serviceDefinitionLoadError.error);
        }
      }
      // Return the composed schema made of individual loaded service definitions.
      const composedSchema = {
        serviceDefinitions: loadedServiceDefinitions,
        isNewSchema: isNewComposedSchema
      };
      return composedSchema;
    } else {
      return super.loadServiceDefinitions(config);
    }
  }

How does this handle dependent types from different services? Like say you have services A and B and some types in B depend on some types from A and A in this case is not running. Would you get an error for service B not being able to resolve all fields? or does it just ignore the dependent fields?

ferwasy commented 3 years ago

@ghiyaom this only works at the network level. In that case I'm afraid that the types that depend on service A won't be available.

ozanturhan commented 2 years ago

We have occurred this issue recently. In our case we have independent applications and services. We don't want our services to unreachable because of one of these. I mean if there is an issue in a service, it shouldn't be affect our other services.

So we decided to making healtchecks for our services before passing it into apollo gateway. I think it is a workaround solution but it solves our problem at least. We have decrased this kind of errors a lot.

But we are sometimes getting this error again. It is really interesting because service is working and it's heltchek url reachble. However, the gateway cannot load the schema definitions.

Heres is an example:

  const services = [
    { name: 'service1', url: 'http://localhost:4001' },
    { name: 'service2', url: 'http://localhost:4002' },
    { name: 'service3', url: 'http://localhost:4003'},
    // more services
  ];

  const healthCheckRequests = services.map(
    service =>
      new Promise(resolve => {
        fetch(`${service.url}/.well-known/apollo/server-health`)
          .then(res => res.json())
          .then(() => resolve(service))
          .catch(() => {
            // Todo: send notification or something like that
            console.log(` bff-${service.name} unreachable`);
            resolve(false);
          });
      }),
  );

  const serviceList = await Promise.all(healthCheckRequests).then(result => result.filter(Boolean));

  const gateway = new ApolloGateway({
    serviceList,
  });

Maybe we could add some kind of retry mechanism in our gateway.

jyling commented 1 year ago

Hi, for anyone that still encountering this issue, I decided to look into the source code and made some changes which allow the server to run even when one of the servers is down

I will override the IntrospectAndCompose, basically, I copied the code of the IntrospectAndCompose 1 to 1 and added a new function called checkSubgraphs.

 public async checkSubgraphs(subgraphs: Service[]) {
    const results = await Promise.all(
      subgraphs.map(async (subgraph) => {
        const subgraphResult = await new Promise((res) =>
          axios
            .post(subgraph?.url, {
              query: `query { __typename }`,
              variables: {},
            })
            .then(() => res(true))
            .catch(() => res(false))
        );

        return { ...subgraph, success: subgraphResult };
      })
    );
    return results.filter((result) => result.success);
  }

Then i updated the updateSupergraphSdl

private async updateSupergraphSdl() {
    const activeServiceList = await this.checkSubgraphs(this.subgraphs!);

    const result = await loadServicesFromRemoteEndpoint({
      serviceList: activeServiceList,
      getServiceIntrospectionHeaders: async (service) => {
        return typeof this.config.introspectionHeaders === 'function'

Basically, this will cause the system to do a few things

If the service is down originally, the graphql won't be included but will include immediately once the service is up (assuming if you have polling turned on)
If the service is up and then goes down, nothing will happen, you get will get 503 when you call the API, but the gateway stay running

dberardo-com commented 1 year ago

is this feature planned for release ?

trevor-scheer commented 1 year ago

@dberardo-com there's no feature proposed here and we're not actively working on this issue. If someone wanted to draft a PR to improve the behavior of IntrospectAndCompose I'd be happy to discuss what kind of changes we'd entertain. As mentioned above, this can be implemented by users today by implementing your own SupergraphManager (like extending IntrospectAndCompose.

dberardo-com commented 1 year ago

are you referring to this strategy? https://github.com/apollographql/federation/issues/355#issuecomment-1488122731

is this possible to achieve in the context of nestjs integration: https://docs.nestjs.com/graphql/federation#federated-example-gateway-1 or does this need to be a sort of fork of the original apollo/gateway library ?

thanks

dberardo-com commented 11 months ago

ok, figured out how to implement the solution by @jyling , it is a simple "extension" of the current IntrospectAndCompose class and use that one inside the gateway definition:

class CustomIntrospectAndCompose extends IntrospectAndCompose {
... code from:  https://github.com/apollographql/federation/issues/355#issuecomment-1488122731

then

supergraphSdl: new CustomIntrospectAndCompose({
.... usual options

it seems to work as intended, very nice

there's no feature proposed here and we're not actively working on this issue

how come ? dont see any value in this ? i have seen this question popping up around within gql community @trevor-scheer

trevor-scheer commented 11 months ago

@dberardo-com sorry I didn't respond to your previous message, but I'm glad you arrived at a solution.

Generally speaking, our efforts are directed at improving router. We aren't actively developing and improving gateway (though it is still maintained) except when it also benefits router (i.e. query planner improvements affect both runtimes). As mentioned above, if someone wanted to open a PR to improve this I'd be happy to facilitate that process.

moinologics commented 7 months ago

@dberardo-com how it is possible to override a private method like updateSupergraphSdl?

dberardo-com commented 7 months ago

@moinologics try this one out and let me know if it worked: https://gist.github.com/dberardo-com/e8ecab26ab16c0753f9f441f820f3cb6

apollographql / federation

Federated gateway - 500 error on whole graph if one underlying graph/service is down #355