Open smyrick opened 1 year ago
I agree that the underlying need is desirable, and this is something that has been mentioned a few times, though in a slightly different form.
Why would it ever make sense to "defer" the inStock
field in the example above? Surely, that's because that field is somewhat costly to resolve. If it's not, that is if there is no meaningful performance difference between getting just Product.id
or both of id
and stock
, then it doesn't really make sense to do such deferring.
So fundamentally, I think this is about allowing the query planner to know about the cost of various fields. And it is true that the query planner currently has to make assumptions when it tries to find "the best" plan, and one of them is that all fields cost the same thing and that doing fetch is overall a lot more costly than resolving a field (with the result that the planner optimise first and foremost for the number of fetches).
But that's obviously not true, and if the planner had access to some cost information, it could do a better job. Here, in a way, @entityDefer
is just saying that inStock
is very costly and so it is worth getting that in parallel of the reviews
(or to put it another way, the cost that inStock
has on delaying the reviews fetch is noticeable enough to justify making one more fetch (in parallel with the reviews)).
Anyway, all this to say that I'd rather introduce this as a @cost
directive or something similar, and not necessarily have this be entirely binary. Amongst other thing, I'll note that in the example of the description, if you just do query:
query getAllProductsNoReview {
products {
id
inStock
}
}
then it makes not sense to "defer" inStock
(it's just a waste of resources), which is why I don't love the idea of presenting this in term of "asking the planner to defer", because it either force the planner to do bad things, or it might gets confusing to users why the planner sometimes ignore what we tell him to do. I prefer keeping it declarative, have subgraph author provide cost information, and let the planner what is best based on that.
Well put! That is exactly the reason for wanting to defer so I agree a better approach is to mark the costly fields and still let the query planner find the best/most-efficient path
@pcmanus to provide more context on the motivation/my discussion with shane - this came up for us at Yelp, which I've distilled down here https://gist.github.com/magicmark/cbda3eedf1255334caee357fde7680de
It sounds like @cost
could be used similar to z-indexes - where it's not exact milliseconds or something, just relative weightings? (although I suppose aggregate timing information could be dumped and used too...)
Having a strong guarantee to subgraph authors that "this big scary chunk of work will be parallelized" would be awesome - we tend to think in big blocks of network waterfalls, and trying to make sure everything is squished together as much as possible.
I don’t think exposing costs of fields to the planner is the right approach here. In general this speaks toward what I think is the biggest issue with Federation as I have experienced it. To begin with, let me set up an example based off of @smyrick's:
Assume I have a monolithic GraphQL server with the following schema, and a response time annotated after each field:
type Query {
product(id: ID!): Product # 1s
}
type Product {
id: ID! # 0s (Key fields are usually synchronous)
manufacturer: Company! # 1s
countryOfOrigin: Country! # 2s
inStock: Boolean! # 3s
}
type Company {
id: ID! # 0s
name: String! # 2s
owner: Person! # 1s
}
type Person {
id: ID! # 0s
name: String! # 1s
}
type Country {
id: ID! # 0s
name: String! # 2s
}
And now I execute this query against that monolithic server:
query GetProductDetails($id: ID!) {
product(id: $id) {
inStock
manufacturer {
name
owner {
name
}
}
countryOfOrigin {
name
}
}
}
The performance of this query is simply given by taking the maximum time it takes to any given leaf field:
Field Path | Path Component Times | Total | |
---|---|---|---|
Query.product.inStock | Query.product (1s) + Product.inStock (3s) |
4s | |
Query.product.manufacturer.name | Query.product (1s) + Product.manufacturer (1s) + Company.name (2s) |
3s | |
Query.product.manufacturer.owner.name | Query.product (1s) + Product.manufacturer (1s) + Company.owner (1s) + Person.name (1s) |
4s | |
Query.product.countryOfOrigin.name | Query.product (1s) + Product.countryOfOrigin (2s) + Country.name (2s) |
5s |
So, in this case given the performance of the entire query is 5s
, given by the most expensive leaf path - Query.product.countryOfOrigin.name
. Because this is a monolith, Product.inStock
, Product.manufacturer
and Product.countryOfOrigin
can all race in parallel efficiently.
This is very intuitive for the client, performance in general is no worse than the worst performing leaf field on its own, and removing/adding fields that do not exceed that runtime are essentially free (consider for example adding or removing Query.product.inStock
from the above query).
This makes for a very clear visual example as a Gantt chart: Now say that the owner of the graph decides to federate their implementation, something like this:
# Product Graph
type Query {
product(id: ID!): Product # 1s
}
type Product {
id: ID! # 0s (Key fields are usually synchronous)
manufacturer: Company! # 1s
countryOfOrigin: Country! # 2s
inStock: Boolean! # 3s
}
type Company @key(fields: "id") {
id: ID! #0s
}
type Country @key(fields: "id") {
id: ID! #0s
}
# Company Graph
type Company @key(fields: "id") {
id: ID! # 0s
name: String! # 2s
owner: Person! # 1s
}
type Person @key(fields: "id") {
id: ID! # 0s
}
# Person Graph
type Person @key(fields: "id") {
id: ID! # 0s
name: String! # 1s
}
# Country Graph
type Country @key(fields: "id") {
id: ID! # 0s
name: String! # 2s
}
We will assume there is no need for __resolveReference
, and that there is zero cost hopping between servers.
Now, if the same client is to execute the same operation, computing the total runtime is a lot harder. First you need to consider the individual subgraph queries:
query GetProductDetails($id: ID!) {
product(id: $id) {
inStock
manufacturer {
id
}
countryOfOrigin {
id
}
}
}
Where the performance is given by:
Field Path | Path Component Times | Total | |
---|---|---|---|
Query.product.inStock | Query.product (1s) + Product.inStock (3s) |
4s | |
Query.product.manufacturer.id | Query.product (1s) + Product.manufacturer (1s) + Manufacturer.id (0s) |
2s | |
Query.product.countryOfOrigin.id | Query.product (1s) + Product.countryOfOrigin (2s) + Country.id (0s) |
3s |
For a total time in the subgraph of 4s
.
query ($representations: [_Any!]!) {
_entities(representations: $representations) {
... on Company {
name
owner {
id
}
}
}
}
Where the performance is given by:
Field Path | Path Component Times | Total |
---|---|---|
Query._entities.name | Query._entities (0s) + Company.name (2s) |
2s |
Query._entities.owner.id | Query._entities (0s) + Company.owner (1s) + Person.id (0s) |
1s |
For 2s
total
query ($representations: [_Any!]!) {
_entities(representations: $representations) {
... on Person {
name
}
}
}
Where the performance is given by:
Field Path | Path Component Times | Total |
---|---|---|
Query._entities.name | Query._entities (0s) + Person.name (1s) |
1s |
For 1s
total time
query ($representations: [_Any!]!) {
_entities(representations: $representations) {
... on County {
name
}
}
}
Where the performance is given by:
Field Path | Path Component Times | Total |
---|---|---|
Query._entities.name | Query._entities (0s) + Country.name (2s) |
2s |
For 2s
.
The Country graph path and the Company → Person graph paths race eachother:
Field Path | Path Component Times | Total |
---|---|---|
Query.product.inStock | Query.product.inStock (Product graph, 4s) | 4s |
Query.product.manufacturer.name | Query.product.manufacturer.id (Product graph, 4s) + Company.name (Company graph, 2s) |
6s |
Query.product.manufacturer.owner.name | Query.product.manufacturer.id (Product graph, 4s) + Company.owner.id (Company graph, 2s) + Person.id (Person graph, 1s) |
7s |
Query.product.countryOfOrigin.name | Query.product.countryOfOrigin.id (Product graph, 4s) + Country.name (Country graph, 2s) |
6s |
Our performance has got significantly worse (7s
vs 5s
) just by adding federation to the underlying graph. Check out the Gantt Chart now:
Notice we have added additional synchronisation points all along the graph that are not necessary.
From the client’s perspective, the performance of a query is now incredibly opaque and confusing. Removing Query.product.inStock
from this query results in the Product graph becoming 1s
faster, and even though it is not in the most expensive path Query.product.manufacturer.owner.name
it results in a 1s
faster query.
In fact, this leaks internal details about the subgraph structure of the services, even though we should be looking at an opaque single GraphQL API.
From the perspective of the GraphQL API’s owner it becomes hard to work out how to optimize this query too - the critical path is now given by the critical path of subgraphs and within each subgraph their own critical path. In my experience the vast majority of queries against federated graphs have a different critical path than they would otherwise, even if the overhead doesn’t look that bad at first glance. Tools like distributed tracing can give you a real view in to the graph performance that looks very similar to this Gantt chart above.
To such a developer this actually suggest that putting more fields in to the Product graph would be beneficial because that would maximise parallelisation, but I think this is unwise. Likewise, it is also possible to see that an alternative would be to give every field its own subgraph - at this point the synchronisation points become identical to the monolith case and we get back the optimal performance of the critical path. Most graphs sit somewhere between these two extremes, and so suffer from the issue.
If we mark up all the fields in the schema with the cost equivalent to their execution durations then the planner has sufficient knowledge to perform splitting of the queries to optimise the operation. In particular, it would probably make multiple requests to the Product graph:
// query1
query GetProductDetails1($id: ID!) {
product(id: $id) {
inStock
}
}
// query2
query GetProductDetails2($id: ID!) {
product(id: $id) {
manufacturer {
id
}
}
}
// query3
query GetProductDetails3($id: ID!) {
product(id: $id) {
countryOfOrigin {
id
}
}
}
The planner may decide that query 2 & 3 could be combined, but given the different costs I suspect they won’t be.
I do not think this is a good idea because it results in the common path of the fields executing multiple times (Query.product
is now executed three times, in three separate network calls to the service). This is of particular concern for mutations where this would not be feasible at the first node in the query plan, which is where I'd guess most of these issues exist.
There is also the problem of what happens if the result for the common path is different in one of the query results than the others (Say query3’s Query.product
returns null
.)
If the server is using cost (either the same or different metrics) to estimate the expense of a query for purposes of rate limiting or execution size limiting, then it becomes non-trivial for the server to calculate the execution cost. It is even harder for the client to reason about it because they should not know about the internals of the implementation.
@defer
for EverythingHere is my suggestion:
For every request to a subgraph where both:
@defer
The planner should issue a query like this (defer names omitted):
query {
__typename
# One defer per downstream dependent key
... @defer {
# fields
}
# One defer for all fields requested in the original request
... @defer {
# fields
}
}
So, for this Product query example, the router would issue the following query:
query GetProductDetails($id: ID!) {
__typename
... @defer {
product(id: $id) {
manufacturer {
id
}
}
}
... @defer {
product(id: $id) {
countryOfOrigin {
id
}
}
}
... @defer {
product(id: $id) {
inStock
}
}
}
This makes it the subgraph’s responsibility to handle the racing of the three selection sets. In particular, this allows it to share the single execution of Query.product
between all three branches properly.
The subgraph will return the selections in the order they are ready, removing the synchronisation points and returning to the optimal 5s
total performance time. The Gantt chart then looks like this:
I would expect the following query to be executed against the Company Graph:
query ($representations: [_Any!]!) {
__typename
... @defer {
_entities(representations: $representations) {
... on Company {
owner {
id
}
}
}
}
... @defer {
_entities(representations: $representations) {
... on Company {
name
}
}
}
}
The defer spec allows the implementing server to make the decision on whether to bother responding to the @defer
selections synchronously or not, allowing the implementing service to make decisions on whether it is beneficial or not in the current circumstance.
Importantly, this is entirely transparent to clients of the graph, so there is no need for a client side directive to control this behaviour or to leak the implementation details of the graph to clients. As far as I can tell it could be implemented without a change to the federation spec as well.
Let me know your thoughts!
Update: I have created a POC JS router that supports calling subgraphs with defer - check it out here: https://github.com/meiamsome/federation-defer-poc
@meiamsome Thank you for the detailed explanation! That is exactly correct on the use case and problems you described, this will be really helpful for anyone else who wants to catch up.
For solutioning what was proposed by @pcmanus was not about how the Router requests slow fields, but when. My initial comment proposed the idea of this being controlled by subgraph developers and them explicitly marking certain fields with a new directive like @serverDefer
or something to say these are the slow fields that Router should request with @defer
.
Instead, what if we considered a configurable option that could take cost estimates into account and defer subgraph requests when they went over a certain cost threshold. If subgraph request cost > 100, use @defer on the most expensive non-key fields.
In your case doing @defer on every single field could roughly be implemented by setting that cost threshold to 1.
Maybe what we would additionally need though is not just a limit on when to use defer at all, but also what should be the max limit of all subgraph requests. So rather than splitting requests until all request were under 100 for one request, a max single request cost of 1 would then basically defer everything
Let's say we have this schema across two subgraphs
Subgraph Products
Subgraph Reviews
I can write the following query and this all works as expected. The query planner is smart enough to split the products query into two separate queries and make an optimized call to the reviews subgraph because it only needs
Product.id
to connect the two.However we have a user requirement that we don't defer the loading of the UI state into chunks and that we want to return everything in one response. Also to use this optimization requires clients to know to use
@defer
. Instead if there was some schema directive we could use to indicate to the query planner that it should not wait for the entire response and do the@defer
optimization but still only return one response, we could better control this logic server side and give everyone the optimization if they don't use client@defer
.Maybe something like
@subgraphDefer
or@entityDefer
Query we want to make
so in the schema we would need something like this
Keyword search Internal defer, entity defer, Router defer, schema defer