Defer internal subgraph requests on non-required fields

smyrick commented 1 year ago

Let's say we have this schema across two subgraphs

Subgraph Products

type Product @key(fields: "id") {
  id: ID!
  inStock: Boolean # really slow field
}

type Query {
  products: [Product] # fast if only asking for ids but slow to get inStock status
}

Subgraph Reviews

type Product @key(fields: "id") {
  id: ID!
  reviews: [Review]
}

type Review  @key(fields: "id") {
  id: ID!
  text: String
}

I can write the following query and this all works as expected. The query planner is smart enough to split the products query into two separate queries and make an optimized call to the reviews subgraph because it only needs Product.id to connect the two.

query GetAllProducts {
  products {
    id
    ... @defer { inStock }
    reviews { text }
  }
}

However we have a user requirement that we don't defer the loading of the UI state into chunks and that we want to return everything in one response. Also to use this optimization requires clients to know to use @defer. Instead if there was some schema directive we could use to indicate to the query planner that it should not wait for the entire response and do the @defer optimization but still only return one response, we could better control this logic server side and give everyone the optimization if they don't use client @defer.

Maybe something like @subgraphDefer or @entityDefer

Query we want to make

query GetAllProducts {
  products {
    id
    inStock # We don't want to wait here
    reviews { text }
  }
}

so in the schema we would need something like this

type Product @key(fields: "id") {
  id: ID!
  inStock: Boolean @entityDefer
}

Keyword search Internal defer, entity defer, Router defer, schema defer

pcmanus commented 1 year ago

I agree that the underlying need is desirable, and this is something that has been mentioned a few times, though in a slightly different form.

Why would it ever make sense to "defer" the inStock field in the example above? Surely, that's because that field is somewhat costly to resolve. If it's not, that is if there is no meaningful performance difference between getting just Product.id or both of id and stock, then it doesn't really make sense to do such deferring.

So fundamentally, I think this is about allowing the query planner to know about the cost of various fields. And it is true that the query planner currently has to make assumptions when it tries to find "the best" plan, and one of them is that all fields cost the same thing and that doing fetch is overall a lot more costly than resolving a field (with the result that the planner optimise first and foremost for the number of fetches).

But that's obviously not true, and if the planner had access to some cost information, it could do a better job. Here, in a way, @entityDefer is just saying that inStock is very costly and so it is worth getting that in parallel of the reviews (or to put it another way, the cost that inStock has on delaying the reviews fetch is noticeable enough to justify making one more fetch (in parallel with the reviews)).

Anyway, all this to say that I'd rather introduce this as a @cost directive or something similar, and not necessarily have this be entirely binary. Amongst other thing, I'll note that in the example of the description, if you just do query:

query getAllProductsNoReview {
  products {
    id
    inStock
  }
}

then it makes not sense to "defer" inStock (it's just a waste of resources), which is why I don't love the idea of presenting this in term of "asking the planner to defer", because it either force the planner to do bad things, or it might gets confusing to users why the planner sometimes ignore what we tell him to do. I prefer keeping it declarative, have subgraph author provide cost information, and let the planner what is best based on that.

smyrick commented 1 year ago

Well put! That is exactly the reason for wanting to defer so I agree a better approach is to mark the costly fields and still let the query planner find the best/most-efficient path

magicmark commented 1 year ago

@pcmanus to provide more context on the motivation/my discussion with shane - this came up for us at Yelp, which I've distilled down here https://gist.github.com/magicmark/cbda3eedf1255334caee357fde7680de

It sounds like @cost could be used similar to z-indexes - where it's not exact milliseconds or something, just relative weightings? (although I suppose aggregate timing information could be dumped and used too...)

Having a strong guarantee to subgraph authors that "this big scary chunk of work will be parallelized" would be awesome - we tend to think in big blocks of network waterfalls, and trying to make sure everything is squished together as much as possible.

meiamsome commented 1 year ago

I don’t think exposing costs of fields to the planner is the right approach here. In general this speaks toward what I think is the biggest issue with Federation as I have experienced it. To begin with, let me set up an example based off of @smyrick's:

Assume I have a monolithic GraphQL server with the following schema, and a response time annotated after each field:

type Query {
  product(id: ID!): Product # 1s
}

type Product {
  id: ID! # 0s (Key fields are usually synchronous)
  manufacturer: Company! # 1s
  countryOfOrigin: Country! # 2s
  inStock: Boolean! # 3s
}

type Company {
  id: ID! # 0s
  name: String! # 2s
  owner: Person! # 1s
}

type Person {
  id: ID! # 0s
  name: String! # 1s
}

type Country {
  id: ID! # 0s
  name: String! # 2s
}

And now I execute this query against that monolithic server:

query GetProductDetails($id: ID!) {
  product(id: $id) {
    inStock
    manufacturer {
      name
      owner {
        name
      }
    }
    countryOfOrigin {
      name
    }
  }
}

The performance of this query is simply given by taking the maximum time it takes to any given leaf field:

Field Path	Path Component Times	Total
Query.product.inStock	Query.product (1s) + Product.inStock (3s)	4s
Query.product.manufacturer.name	Query.product (1s) + Product.manufacturer (1s) + Company.name (2s)	3s
Query.product.manufacturer.owner.name	Query.product (1s) + Product.manufacturer (1s) + Company.owner (1s) + Person.name (1s)	4s
Query.product.countryOfOrigin.name	Query.product (1s) + Product.countryOfOrigin (2s) + Country.name (2s)	5s

So, in this case given the performance of the entire query is 5s, given by the most expensive leaf path - Query.product.countryOfOrigin.name. Because this is a monolith, Product.inStock, Product.manufacturer and Product.countryOfOrigin can all race in parallel efficiently. This is very intuitive for the client, performance in general is no worse than the worst performing leaf field on its own, and removing/adding fields that do not exceed that runtime are essentially free (consider for example adding or removing Query.product.inStock from the above query).

This makes for a very clear visual example as a Gantt chart: Unfederated Gantt chart showing 5s of execution time Now say that the owner of the graph decides to federate their implementation, something like this:

# Product Graph
type Query {
  product(id: ID!): Product # 1s
}

type Product {
  id: ID! # 0s (Key fields are usually synchronous)
  manufacturer: Company! # 1s
  countryOfOrigin: Country! # 2s
  inStock: Boolean! # 3s
}

type Company @key(fields: "id") {
  id: ID! #0s
}

type Country @key(fields: "id") {
  id: ID! #0s
}

# Company Graph
type Company @key(fields: "id") {
  id: ID! # 0s
  name: String! # 2s
  owner: Person! # 1s
}

type Person @key(fields: "id") {
  id: ID! # 0s
}

# Person Graph
type Person @key(fields: "id") {
  id: ID! # 0s
  name: String! # 1s
}

# Country Graph
type Country @key(fields: "id") {
  id: ID! # 0s
  name: String! # 2s
}

We will assume there is no need for __resolveReference, and that there is zero cost hopping between servers.

Now, if the same client is to execute the same operation, computing the total runtime is a lot harder. First you need to consider the individual subgraph queries:

Product Graph

query GetProductDetails($id: ID!) {
  product(id: $id) {
    inStock
    manufacturer {
      id
    }
    countryOfOrigin {
      id
    }
  }
}

Where the performance is given by:

Field Path	Path Component Times	Total
Query.product.inStock	Query.product (1s) + Product.inStock (3s)	4s
Query.product.manufacturer.id	Query.product (1s) + Product.manufacturer (1s) + Manufacturer.id (0s)	2s
Query.product.countryOfOrigin.id	Query.product (1s) + Product.countryOfOrigin (2s) + Country.id (0s)	3s

For a total time in the subgraph of 4s.

Company Graph

query ($representations: [_Any!]!) {
  _entities(representations: $representations) {
    ... on Company {
      name
      owner {
        id
      }
    }
  }
}

Where the performance is given by:

Field Path	Path Component Times	Total
Query._entities.name	Query._entities (0s) + Company.name (2s)	2s
Query._entities.owner.id	Query._entities (0s) + Company.owner (1s) + Person.id (0s)	1s

For 2s total

Person Graph

query ($representations: [_Any!]!) {
  _entities(representations: $representations) {
    ... on Person {
      name
    }
  }
}

Where the performance is given by:

Field Path	Path Component Times	Total
Query._entities.name	Query._entities (0s) + Person.name (1s)	1s

For 1s total time

Country Graph

query ($representations: [_Any!]!) {
  _entities(representations: $representations) {
    ... on County {
      name
    }
  }
}

Where the performance is given by:

Field Path	Path Component Times	Total
Query._entities.name	Query._entities (0s) + Country.name (2s)	2s

For 2s.

Overall Performance

The Country graph path and the Company → Person graph paths race eachother:

Field Path	Path Component Times	Total
Query.product.inStock	Query.product.inStock (Product graph, 4s)	4s
Query.product.manufacturer.name	Query.product.manufacturer.id (Product graph, 4s) + Company.name (Company graph, 2s)	6s
Query.product.manufacturer.owner.name	Query.product.manufacturer.id (Product graph, 4s) + Company.owner.id (Company graph, 2s) + Person.id (Person graph, 1s)	7s
Query.product.countryOfOrigin.name	Query.product.countryOfOrigin.id (Product graph, 4s) + Country.name (Country graph, 2s)	6s

Our performance has got significantly worse (7s vs 5s) just by adding federation to the underlying graph. Check out the Gantt Chart now: Federated Gantt chart showing 7s of execution time

Notice we have added additional synchronisation points all along the graph that are not necessary. From the client’s perspective, the performance of a query is now incredibly opaque and confusing. Removing Query.product.inStock from this query results in the Product graph becoming 1s faster, and even though it is not in the most expensive path Query.product.manufacturer.owner.name it results in a 1s faster query.

In fact, this leaks internal details about the subgraph structure of the services, even though we should be looking at an opaque single GraphQL API.

From the perspective of the GraphQL API’s owner it becomes hard to work out how to optimize this query too - the critical path is now given by the critical path of subgraphs and within each subgraph their own critical path. In my experience the vast majority of queries against federated graphs have a different critical path than they would otherwise, even if the overhead doesn’t look that bad at first glance. Tools like distributed tracing can give you a real view in to the graph performance that looks very similar to this Gantt chart above.

To such a developer this actually suggest that putting more fields in to the Product graph would be beneficial because that would maximise parallelisation, but I think this is unwise. Likewise, it is also possible to see that an alternative would be to give every field its own subgraph - at this point the synchronisation points become identical to the monolith case and we get back the optimal performance of the critical path. Most graphs sit somewhere between these two extremes, and so suffer from the issue.

Addressing the cost suggestion

If we mark up all the fields in the schema with the cost equivalent to their execution durations then the planner has sufficient knowledge to perform splitting of the queries to optimise the operation. In particular, it would probably make multiple requests to the Product graph:

// query1
query GetProductDetails1($id: ID!) {
  product(id: $id) {
    inStock
  }
}
// query2
query GetProductDetails2($id: ID!) {
  product(id: $id) {
    manufacturer {
     id
    }
  }
}
// query3
query GetProductDetails3($id: ID!) {
  product(id: $id) {
    countryOfOrigin {
     id
    }
  }
}

The planner may decide that query 2 & 3 could be combined, but given the different costs I suspect they won’t be.

I do not think this is a good idea because it results in the common path of the fields executing multiple times (Query.product is now executed three times, in three separate network calls to the service). This is of particular concern for mutations where this would not be feasible at the first node in the query plan, which is where I'd guess most of these issues exist.

There is also the problem of what happens if the result for the common path is different in one of the query results than the others (Say query3’s Query.product returns null.)

If the server is using cost (either the same or different metrics) to estimate the expense of a query for purposes of rate limiting or execution size limiting, then it becomes non-trivial for the server to calculate the execution cost. It is even harder for the client to reason about it because they should not know about the internals of the implementation.

Using `@defer` for Everything

Here is my suggestion:

For every request to a subgraph where both:

The subgraph supports @defer
The request is not a leaf node in the query plan

The planner should issue a query like this (defer names omitted):

query {
  __typename
  # One defer per downstream dependent key
  ... @defer {
    # fields
  }
  # One defer for all fields requested in the original request
  ... @defer {
    # fields
  }
}

So, for this Product query example, the router would issue the following query:

query GetProductDetails($id: ID!) {
  __typename
  ... @defer {
    product(id: $id) {
      manufacturer {
        id
      }
    }
  }
  ... @defer {
    product(id: $id) {
      countryOfOrigin {
        id
      }
    }
  }
  ... @defer {
    product(id: $id) {
      inStock
    }
  }
}

This makes it the subgraph’s responsibility to handle the racing of the three selection sets. In particular, this allows it to share the single execution of Query.product between all three branches properly.

The subgraph will return the selections in the order they are ready, removing the synchronisation points and returning to the optimal 5s total performance time. The Gantt chart then looks like this:

Proposed federated Gantt chart, showing 5s of execution time

I would expect the following query to be executed against the Company Graph:

query ($representations: [_Any!]!) {
  __typename
  ... @defer {
    _entities(representations: $representations) {
      ... on Company {
        owner {
          id
        }
      }
    }
  }
  ... @defer {
    _entities(representations: $representations) {
      ... on Company {
        name
      }
    }
  }
}

The defer spec allows the implementing server to make the decision on whether to bother responding to the @defer selections synchronously or not, allowing the implementing service to make decisions on whether it is beneficial or not in the current circumstance.

Importantly, this is entirely transparent to clients of the graph, so there is no need for a client side directive to control this behaviour or to leak the implementation details of the graph to clients. As far as I can tell it could be implemented without a change to the federation spec as well.

Let me know your thoughts!

Update: I have created a POC JS router that supports calling subgraphs with defer - check it out here: https://github.com/meiamsome/federation-defer-poc

smyrick commented 1 year ago

@meiamsome Thank you for the detailed explanation! That is exactly correct on the use case and problems you described, this will be really helpful for anyone else who wants to catch up.

For solutioning what was proposed by @pcmanus was not about how the Router requests slow fields, but when. My initial comment proposed the idea of this being controlled by subgraph developers and them explicitly marking certain fields with a new directive like @serverDefer or something to say these are the slow fields that Router should request with @defer.

Instead, what if we considered a configurable option that could take cost estimates into account and defer subgraph requests when they went over a certain cost threshold. If subgraph request cost > 100, use @defer on the most expensive non-key fields.

In your case doing @defer on every single field could roughly be implemented by setting that cost threshold to 1.

Maybe what we would additionally need though is not just a limit on when to use defer at all, but also what should be the max limit of all subgraph requests. So rather than splitting requests until all request were under 100 for one request, a max single request cost of 1 would then basically defer everything

magicmark commented 1 week ago

apollographql / federation