Open saul-data opened 4 years ago
+1
Maybe Managed federation is able to better manage offline services: https://www.apollographql.com/docs/studio/managed-federation/overview/
+1 it should also retry in background if it fails.
+1
Adding +1 to this issue.
Also, adding some debugging notes here:
Scenario: In cases where we have multiple queries in a single operation, which relates to two services let's say A and B; the gateway creates a query plan something like this
Parallel { Fetch A Fetch B }
And if any of the service is down, the gateway would fail the entire operation.
Findings:
In case of parallel fetch execution, the gateway expects all the promises (fetches) to resolve, and in case any of the promise rejects it fails the entire operation. So when on of the service is down, it throws a fetch error and that exception is thrown all way up, which results in rejection of promise and hence the entire operation.
What we need to have is exception handling at this place --> https://github.com/apollographql/federation/blob/main/gateway-js/src/datasources/RemoteGraphQLDataSource.ts#L170, or somewhere above the chain so that its gracefully handled and results in promise resolution.
+1
This issue is exacerbated by Kubernetes. When we do our bi-weekly PROD cluster roll (which takes about an hour), we see this error on about 30% of our requests. Our graphql gateway service averages about 25 pods and federates 10 microservices which average ~10 pods each. When Kubernetes terminates a microservice pod in the midst of federation, the gateway pod returns this 500 error and remains in an "unhealthy" state.
This also happens when we scale down after a spike in traffic as Kubernetes terminates the excess pods.
We realize that polling microservices in a dynamic environment like Kubernetes will always be prone to this type of error. One mitigation strategy might be a single, reflexive retry of the failing federation request. An additional mitigation strategy is for the failing gateway pod to "fail its health check" and trigger Kubernetes to restart the pod. We know these strategies won't eliminate the issue, but it would reduce it to a manageable level.
Of course, the real solution is to use a graphql registry. Unfortunately, the current Apollo enterprise pricing model puts this outside our budget. We understand the Apollo business model is to provide enough open-source functionality to get us hooked on the platform (we love you, Apollo, so mission accomplished) and for us to eventually upgrade to a paid platform when we reach scale (which we want to do). We also appreciate the fact that Apollo may not be ready to cannibalize their enterprise business by offering lower-tier or a la carte pricing for folks like us.
So, this leaves us with these alternatives:
And these observations:
Full disclosure -- we knew this day would come. Our bet was that Apollo would saturate the enterprise market and begin to go down-market before we scaled to a level where we needed a dedicated registry. Experience tells me there are quite a few companies like ours that are facing this challenge.
It sounds like you would get what you need via managed federation, which is available on the free tier of Apollo GraphQL (as well as the paid tiers)
Is that what you are looking for?
You can also provide your own way to pull together a configuration for the gateway using updateServiceDefinitions
(see here). This was designed to be used for environments that are locked down.
It sounds like you would get what you need via managed federation, which is available on the free tier of Apollo GraphQL (as well as the paid tiers)
Is that what you are looking for?
Thanks for the response. I'll make some enquiries about this with our Apollo rep.
@davidpickavance, managed federation is what we want, but at an entry level price of $50k per year, it's outside our budget. It's bundled with a bunch of services we don't need and Apollo is not quite ready for a la carte pricing.
@ccmcbeck @saul-gush did you get any solution for this issue? were you able to use updateServiceDefinitions or any other workaround? we are also facing the same issue of keeping the gateway up if one service is down.
@ayushnawani Nope never found a solution and Im looking into what alternatives there could be. We still find problems where the gateway unexpectedly hangs so we need to manually restart our Kubernetes pods. I am also worried about the speed, we are seeing the underlying Golang graphql services at 9ms but the gateway is adding on a good 100 to 300ms in some cases.
At Galley Solutions we found a way to solve this by overriding the ApolloGateway.loadServiceDefinitions()
method, loading each service definition individually and then composing a schema with the available underlying services only:
async loadServiceDefinitions(
config: RemoteGatewayConfig | ManagedGatewayConfig
): ReturnType<Experimental_UpdateServiceDefinitions> {
if (isRemoteConfig(config)) {
let isNewComposedSchema = true;
const loadedServiceDefinitions: ServiceDefinition[] = [];
const serviceDefinitionsLoadErrors: ServiceDefinitionLoadError[] = [];
// Iterate through each service definition
for (const serviceDefinition of config.serviceList) {
// Create an individual service config with the current service only
const individualServiceConfig = {
...config,
serviceList: [serviceDefinition]
};
try {
// Try to load the individual service definition
const loadedServiceDefinition = await super.loadServiceDefinitions(
individualServiceConfig
);
if (!isNil(loadedServiceDefinition.serviceDefinitions)) {
// Store the individual loaded service definition
loadedServiceDefinitions.push(
loadedServiceDefinition.serviceDefinitions[0]
);
isNewComposedSchema =
isNewComposedSchema || loadedServiceDefinition.isNewSchema;
}
} catch (error) {
// This is the key. Store the error and the associated service definition,
// but DO NOT THROW. So we can build a schema with the available services.
// If we throw, the whole graph will be unavailable.
serviceDefinitionsLoadErrors.push({
serviceDefinition,
error: error.message
});
}
}
if (!isEmpty(serviceDefinitionsLoadErrors)) {
// Log all the service definitions and potentially do something else with them
for (let serviceDefinitionLoadError of serviceDefinitionsLoadErrors) {
logger.error(serviceDefinitionLoadError.error);
}
}
// Return the composed schema made of individual loaded service definitions.
const composedSchema = {
serviceDefinitions: loadedServiceDefinitions,
isNewSchema: isNewComposedSchema
};
return composedSchema;
} else {
return super.loadServiceDefinitions(config);
}
}
@ferwasy can we add a retry in this as well?
@kodeine Yes, sure. There are several retry strategies that can be implemented.
@ayushnawani, we reverted to a static schema on one of our GraphQL gateways. It's not as bad as it sounds, but it's not optimal. We actually have 2 Backend For Frontend (BFF) GraphQL gateways:
But https://github.com/ferwasy solution is interesting.
@ccmcbeck we were also aware of that issue and created https://github.com/StarpTech/graphql-registry It'd be great to get feedback.
Apollo example: https://github.com/StarpTech/graphql-registry/tree/main/examples/apollo-federation
At Galley Solutions we found a way to solve this by overriding the
ApolloGateway.loadServiceDefinitions()
method, loading each service definition individually and then composing a schema with the available underlying services only:async loadServiceDefinitions( config: RemoteGatewayConfig | ManagedGatewayConfig ): ReturnType<Experimental_UpdateServiceDefinitions> { if (isRemoteConfig(config)) { let isNewComposedSchema = true; const loadedServiceDefinitions: ServiceDefinition[] = []; const serviceDefinitionsLoadErrors: ServiceDefinitionLoadError[] = []; // Iterate through each service definition for (const serviceDefinition of config.serviceList) { // Create an individual service config with the current service only const individualServiceConfig = { ...config, serviceList: [serviceDefinition] }; try { // Try to load the individual service definition const loadedServiceDefinition = await super.loadServiceDefinitions( individualServiceConfig ); if (!isNil(loadedServiceDefinition.serviceDefinitions)) { // Store the individual loaded service definition loadedServiceDefinitions.push( loadedServiceDefinition.serviceDefinitions[0] ); isNewComposedSchema = isNewComposedSchema || loadedServiceDefinition.isNewSchema; } } catch (error) { // This is the key. Store the error and the associated service definition, // but DO NOT THROW. So we can build a schema with the available services. // If we throw, the whole graph will be unavailable. serviceDefinitionsLoadErrors.push({ serviceDefinition, error: error.message }); } } if (!isEmpty(serviceDefinitionsLoadErrors)) { // Log all the service definitions and potentially do something else with them for (let serviceDefinitionLoadError of serviceDefinitionsLoadErrors) { logger.error(serviceDefinitionLoadError.error); } } // Return the composed schema made of individual loaded service definitions. const composedSchema = { serviceDefinitions: loadedServiceDefinitions, isNewSchema: isNewComposedSchema }; return composedSchema; } else { return super.loadServiceDefinitions(config); } }
How does this handle dependent types from different services? Like say you have services A and B and some types in B depend on some types from A and A in this case is not running. Would you get an error for service B not being able to resolve all fields? or does it just ignore the dependent fields?
@ghiyaom this only works at the network level. In that case I'm afraid that the types that depend on service A won't be available.
We have occurred this issue recently. In our case we have independent applications and services. We don't want our services to unreachable because of one of these. I mean if there is an issue in a service, it shouldn't be affect our other services.
So we decided to making healtchecks for our services before passing it into apollo gateway. I think it is a workaround solution but it solves our problem at least. We have decrased this kind of errors a lot.
But we are sometimes getting this error again. It is really interesting because service is working and it's heltchek url reachble. However, the gateway cannot load the schema definitions.
Heres is an example:
const services = [
{ name: 'service1', url: 'http://localhost:4001' },
{ name: 'service2', url: 'http://localhost:4002' },
{ name: 'service3', url: 'http://localhost:4003'},
// more services
];
const healthCheckRequests = services.map(
service =>
new Promise(resolve => {
fetch(`${service.url}/.well-known/apollo/server-health`)
.then(res => res.json())
.then(() => resolve(service))
.catch(() => {
// Todo: send notification or something like that
console.log(` bff-${service.name} unreachable`);
resolve(false);
});
}),
);
const serviceList = await Promise.all(healthCheckRequests).then(result => result.filter(Boolean));
const gateway = new ApolloGateway({
serviceList,
});
Maybe we could add some kind of retry mechanism in our gateway.
Hi, for anyone that still encountering this issue, I decided to look into the source code and made some changes which allow the server to run even when one of the servers is down
I will override the IntrospectAndCompose
, basically, I copied the code of the IntrospectAndCompose
1 to 1 and added a new function called checkSubgraphs
.
public async checkSubgraphs(subgraphs: Service[]) {
const results = await Promise.all(
subgraphs.map(async (subgraph) => {
const subgraphResult = await new Promise((res) =>
axios
.post(subgraph?.url, {
query: `query { __typename }`,
variables: {},
})
.then(() => res(true))
.catch(() => res(false))
);
return { ...subgraph, success: subgraphResult };
})
);
return results.filter((result) => result.success);
}
Then i updated the updateSupergraphSdl
private async updateSupergraphSdl() {
const activeServiceList = await this.checkSubgraphs(this.subgraphs!);
const result = await loadServicesFromRemoteEndpoint({
serviceList: activeServiceList,
getServiceIntrospectionHeaders: async (service) => {
return typeof this.config.introspectionHeaders === 'function'
Basically, this will cause the system to do a few things
is this feature planned for release ?
@dberardo-com there's no feature proposed here and we're not actively working on this issue. If someone wanted to draft a PR to improve the behavior of IntrospectAndCompose
I'd be happy to discuss what kind of changes we'd entertain. As mentioned above, this can be implemented by users today by implementing your own SupergraphManager
(like extending IntrospectAndCompose
.
are you referring to this strategy? https://github.com/apollographql/federation/issues/355#issuecomment-1488122731
is this possible to achieve in the context of nestjs integration: https://docs.nestjs.com/graphql/federation#federated-example-gateway-1 or does this need to be a sort of fork of the original apollo/gateway library ?
thanks
ok, figured out how to implement the solution by @jyling , it is a simple "extension" of the current IntrospectAndCompose class and use that one inside the gateway definition:
class CustomIntrospectAndCompose extends IntrospectAndCompose {
... code from: https://github.com/apollographql/federation/issues/355#issuecomment-1488122731
then
supergraphSdl: new CustomIntrospectAndCompose({
.... usual options
it seems to work as intended, very nice
there's no feature proposed here and we're not actively working on this issue
how come ? dont see any value in this ? i have seen this question popping up around within gql community @trevor-scheer
@dberardo-com sorry I didn't respond to your previous message, but I'm glad you arrived at a solution.
Generally speaking, our efforts are directed at improving router. We aren't actively developing and improving gateway (though it is still maintained) except when it also benefits router (i.e. query planner improvements affect both runtimes). As mentioned above, if someone wanted to open a PR to improve this I'd be happy to facilitate that process.
@dberardo-com how it is possible to override a private method like updateSupergraphSdl
?
@moinologics try this one out and let me know if it worked: https://gist.github.com/dberardo-com/e8ecab26ab16c0753f9f441f820f3cb6
Hi there, I have been trying out federated gateway and I am worried about the entire service going down if one graph is not running.
If you have a number of micro graphql services underneath that can cause a lot of problems. I tried turning off / on services while starting up federated gateway. It should really pick up the services more gracefully.
For example: the below caused the entire graphql to result in 500 internal error (instead of just reporting an error that one service is down and allowing all the other services to operate as normal. It should rather be an online/offline kind of thing.
Error checking for changes to service definitions: Couldn't load service definitions for "testgraphLocal" at http://test-graph:8099/query: request to http://test-graph:8099/query failed, reason: getaddrinfo ENOTFOUND test-graph This data graph is missing a valid configuration. Couldn't load service definitions for "testgraphLocal" at http://test-graph:8099/query: request to http://test-graph:8099/query failed, reason: getaddrinfo ENOTFOUND test-graph