Support for circuit breaking in subgraph requests

lennyburdette commented 1 year ago

Some customers are blocked on migrating from Gateway to Router because they've implemented client-side circuit breaking in their RemoteGraphQLDataSource subclasses.

Describe the solution you'd like A circuit breaking section of the traffic shaping configuration. I'm not entirely sure what that configuration would look like, but envoy has an example.

Describe alternatives you've considered Some customers can use their service mesh, but not all.

Additional context I wonder if we can make this more GraphQL-aware. Subgraphs are usually stateless/horizontally-scalable services that depend on underlying data sources (REST APIs, databases, etc.) We could provide more benefit by protecting those data services without making the subgraphs implement circuit breakers themselves.

We could count and identify problems with:

Specific subgraph fetches
Errors on a particular path ({ "errors": [{ "message": "no db connection", "path": ["products", 0, "reviews"] }] })

and trip the circuit for isolated sections of the graph. This lessens the blast radius of the circuit breaker (otherwise errors in a single resolver could "take down" an entire subgraph.)

smyrick commented 1 year ago

Just as a callout, we had one large enterprise customer migrate to the Router where they were using CBs in the Gateway but decided to still go forward with the Router anyway with no CB. They were actually surprised that they still have no issues now continuing to send requests to subgraphs even if they are erroring out.

The reason they used CB in a Gateway was not necessarily to protect the subgraph service, which is already throwing errors but to help manage the added latency we can expect if a large portion of requests going to a subgraph would just fail. If they had a 5s timeout on those requests that means that most likely every request going to the subgraph will sit there and wait for all 5 seconds.

In the Gateway, this used up valuable connection pools and event loop time to check, but in the Router async, multi-threaded connections are extremely more performant and can actually scale out to handle.

In some instances, this means that you might even recover faster than a circuit breaker as you are no longer filtering traffic to a low percentage.

THIS DOES NOT SOLVE THE ISSUE, but it does mean that the router was able to scale out to handle the increase in error rate and timeouts vs the Gateway shutting down.

theJC commented 1 year ago

Reason we use CB in a Gateway/Router is not to protect the subgraph service

I think Fowler would disagree 😉

https://martinfowler.com/bliki/CircuitBreaker.html

On their own, circuit breakers help reduce resources tied up in operations which are likely to fail. You avoid waiting on timeouts for the client, and a broken circuit avoids putting load on a struggling server.

The reason we implemented circuit breakers in our Gateway's remote data sources was to do both: reduce resource utilization on the Gateway server AND not to continue to dump gasoline on a dumpster if a subgraph is having a really bad day.

Fluxx commented 1 year ago

I will chime in with our use case where I think this pattern makes sense, albeit perhaps with more client control.

We have an internal GraphQL client who has a complex and latency-sensitive query that hits multiple subgraphs. They'd like to guarantee their query resolves within a certain time t. They can obviously do that with their own HTTP request timeout, but that will fail the entire query, even if even just one field/sugraph took > t to respond.

A circuit breaker solution as described here (router config) would solve their problem as we could configure the router to add a timeout of < t to known sensitive graphs, but this comes at the expense of applying that rule to all querie to that subgraph, which will affect any clients who are willing to wait > t for a field to resolve, which is undesirable for other use cases as well. The

An ideal solution for us would be to allow clients to use directives to mark the fields they would like to timeout, and for how long. The client can know that that any fields they mark with the directive may be null since they timed out, but if that is true any field that did not timeout will resolve and the query will be returned to them.

The @defer with timeout issue described in https://github.com/apollographql/apollo-feature-requests/issues/406 I think would probably meet our needs here. Our client could mark any fields they are okay with timing out with @defer(timeout: t), and then the non-deferred field would return first, with any other deferred fields incrementally being returned or timed out. I don't know enough about the implementation of @defer to know if it supports the notion of timing out. From what I recall over HTTP @defer uses multipart and so to support timeouts, the server would have to detect when a timeout has happened and finish the multipart request.

FWIW, our plan internally had been to implement our own custom @timeout directive, that clients could specify to say how long they are willing to wait for that field to resolve before it times out. Our servers would look for that directive and then apply the timeout, and if it does timeout, the field would return null and we'd add info to the errors portion of the response.

apollographql / router

Support for circuit breaking in subgraph requests #2251