Overfetching on `union` types with conditional fragments in query operation

rhino88 commented 1 year ago

Issue Description

When including conditional fragments on a union field the resolvers for fragments appear to be requested unconditionally from the router.

In the example below, the resolver for Product#c is invoked when the conditional fragment for ... on ComponentC is not needed given a response without any instances of ComponentC from the card service. Removing the conditional fragment for ... on ComponentC prevents the Product#c value from being requested, as expected.

Our production router is currently using v1.20.0 but we also confirmed that this issue exists on v1.28.0. This also impacts the gateway as seen in the reproduction link.

Subgraph A Schema(Card)

``` extend schema @link( url: "https://specs.apollo.dev/federation/v2.3" import: ["@key"] ) type Query { card: ProductCard } type ProductCard @key(fields: "id") { id: ID! components: Component } type Product @key(fields: "id") { id: ID! } union Component = ComponentA | ComponentB | ComponentC type ComponentA { a: String product: Product } type ComponentB { b: String product: Product } type ComponentC { c: String product: Product } ```

Subgraph B Schema (Product)

``` extend schema @link( url: "https://specs.apollo.dev/federation/v2.3" import: ["@key", "@interfaceObject"] ) type Product @key(fields: "id") { id: ID! a: String! b: String! c: String! } ```

Query

``` query Foo { card { components { __typename ... on ComponentA { product { id a } } ... on ComponentB { product { id b } } ... on ComponentC { product { id c } } } } } ```

Query Plan

``` QueryPlan { Sequence { Fetch(service: "card") { { card { components { __typename ... on ComponentA { product { __typename id } } ... on ComponentB { product { __typename id } } ... on ComponentC { product { __typename id } } } } } }, Flatten(path: "card.components.product") { Fetch(service: "product") { { ... on Product { __typename id } } => { ... on Product { a b c } } }, }, }, } ```

Response

``` { "data": { "card": { "components": { "__typename": "ComponentA", "product": { "id": "product1", "a": "A1" } } } } } ```

Link to Reproduction

https://codesandbox.io/p/sandbox/eloquent-christian-kcs989

Reproduction Steps

Run the query above, and notice the C resolver -- BAD and B resolver -- BAD in the terminal output.

korinne commented 1 year ago

Thanks for raising this request! Our team has added the item to our backlog, but it does require a bit more design and thought before we can prioritize it. We will definitely keep this issue updated once it's been officially prioritized, and we are also open to community contributions.

XtofX commented 1 year ago

from our stand point, this issue would deserve a better priority that P4 as it affects one fundamental aspect of graphql (prevent overfetching). The resolution of this issue by Apollo is very important to us at Wayfair.

gwardwell commented 1 year ago

We're seeing a latency impact from the query planner "popping" fields from union fragments (which should be conditional on the type being present in the actual result set) up into a sequential fetch and over fetching. The reproduction looks something like this:

# Subgraph 1
type Query {
  card: ProductCard
}

type ProductCard @key(fields: "id") {
  id: ID!
  components: Component
}

type Product @key(fields: "newKey") {
  newKey: ID!
}

union Component = ComponentA | ComponentB | ComponentC

type ComponentA {
  a: String
  product: Product
}

type ComponentB {
  b: String
  product: Product
}

type ComponentC {
  c: String
  product: Product
}

# Subgraph 2
type Product @key(fields: "oldKey") @key(fields: "newKey") {
  oldKey: ID!
  newKey: ID!
  a: String!
}

# Subgraph 3
type Product @key(fields: "oldKey") {
  oldKey: ID!
  b: String!
}

# Subgraph 4
type Product @key(fields: "oldKey") {
  oldKey: ID!
  c: String!
}

# Query
query Foo {
  card {
    components {
      ... on ComponentA {
        __typename
        product {
          __typename
          a
        }
      }
      ... on ComponentB {
        __typename
        product {
          __typename
          b
        }
      }
      ... on ComponentC {
        __typename
        product {
          __typename
          c
        }
      }
    }
  }
}

Expected behavior

The expectation is that Router would:

Go to subgraph 1 to get blocks
Go to Subgraph 2 to convert newKey to oldKey
Conditionally go to Subgraph 2, 3, or 4 to get the a, b, and c fields based on the type returned in blocks
- This expectation is because, be definition, the union is of a limited number of unique types without overlap. Regardless of whether underlying fields in the union have overlap, the union members themselves do not and should be treated as unique objects. The expectation is that Router would craft a static query plan allowing it to pivot to the correct union member resolution since they represent unique outcomes.

Here's the expected query plan (or something similar to this is expected):

QueryPlan {
  Sequence {
    Fetch(service: "subgraph-1") {
      {
        card {
          components {
            __typename
            ... on ComponentA {
              __typename
              product {
                __typename
                newKey
              }
            }
            ... on ComponentB {
              __typename
              product {
                __typename
                newKey
              }
            }
            ... on ComponentC {
              __typename
              product {
                __typename
                newKey
              }
            }
          }
        }
      }
    },
    Flatten(path: "card.components.product") {
      Fetch(service: "subgraph-2") {
        {
          ... on Product {
            __typename
            newKey
          }
        } =>
        {
          ... on Product {
            __typename
            oldKey
          }
        }
      },
    },
    Parallel {
Flatten(path: "card.components.product") {
        Fetch(service: "subgraph-2") {
          {
            ... on Product {
              __typename
              newKey
            }
          } =>
          {
            ... on Product {
              __typename
              a
            }
          }
        },
      },
      Flatten(path: "card.components.product") {
        Fetch(service: "subgraph-3") {
          {
            ... on Product {
              __typename
              oldKey
            }
          } =>
          {
            ... on Product {
              __typename
              b
            }
          }
        },
      },
      Flatten(path: "card.components.product") {
        Fetch(service: "subgraph-4") {
          {
            ... on Product {
              __typename
              oldKey
            }
          } =>
          {
            ... on Product {
              __typename
              c
            }
          }
        },
      },
    },
  },
}

Actual behavior

Instead, Router does the following:

Goes to subgraph 1 to get blocks
Goes to subgraph 2 to convert newKey to oldKey. It is at this point overfetching occurs as Router pops field a out of the union and fetches it at the same time as the key conversion. In practice, this is resulting in a latency cost as field a represents a higher latency field that, in most cases, we do not want to fetch.
Goes to subgraph 3 or 4 based on the union return type.

Here's the query plan:

QueryPlan {
  Sequence {
    Fetch(service: "subgraph-1") {
      {
        card {
          components {
            __typename
            ... on ComponentA {
              __typename
              product {
                __typename
                newKey
              }
            }
            ... on ComponentB {
              __typename
              product {
                __typename
                newKey
              }
            }
            ... on ComponentC {
              __typename
              product {
                __typename
                newKey
              }
            }
          }
        }
      }
    },
    Flatten(path: "card.components.product") {
      Fetch(service: "subgraph-2") {
        {
          ... on Product {
            __typename
            newKey
          }
        } =>
        {
          ... on Product {
            __typename
            a
            oldKey
          }
        }
      },
    },
    Parallel {
      Flatten(path: "card.components.product") {
        Fetch(service: "subgraph-3") {
          {
            ... on Product {
              __typename
              oldKey
            }
          } =>
          {
            ... on Product {
              __typename
              b
            }
          }
        },
      },
      Flatten(path: "card.components.product") {
        Fetch(service: "subgraph-4") {
          {
            ... on Product {
              __typename
              oldKey
            }
          } =>
          {
            ... on Product {
              __typename
              c
            }
          }
        },
      },
    },
  },
}

Current Band-aid fixes

I have identified two band-aid fixes for this example. But, to be clear, the do not fix the problem. These are just two ways to try and force the query plan shape to change.

Option 1

If the oldKey in this situation is eliminated, then the query plan reflects the ideal state as there is no sequential fetch that causes a to pop out. It stops being always fetched. This is a challenging remedy as it requires architectural/data changes to support the new identifier.

Option 2

If field a was no longer part of Subgraph 2, but instead was moved to, say, Subgraph 5, then it would achieve the desired query plan effect. However, moving field a to, say, subgraph 3 would result in shifting the over fetching from every request to any request where the response included Component A or Component B. So it's not really a "fix" but just assertion of a small amount of influence over the query plan.

Summary

The crux of the issue is that router is overly eager to reduce subgraph requests at the cost of overfetching. This breaks a core tenant of GraphQL which is only getting what you ask for. The returned shape of the union does not dictate that field a is necessary, yet it is always fetched regardless. And this is only a single example of such behavior.

Ultimately, the Query planner needs to consider fragments on union members as unique units of the operation and should not pop fields out of those fragments. In some cases, yes it will result in additional requests that in some cases could have been avoided. But it's that or over fetching and, IMO, over fetching is the worse of two evils. This is until the query planner is able to include if/else conditions in it's static flow (which is another conversation for another day).

This is why I support elevating this higher than a P4 issue. This really does feel more like a P1 or P2 at the lowest because it will result in UI, UX, and very difficult to determine, quantify, and communicate schema design constraints. I'm already having conversations with teams about what features they need to cut, or what subgraphs need to be broken apart to try and influence the query plan to avoid this overfetching. That's not a concern that should be leaking into the domain separation.

gwardwell commented 11 months ago

FWIW, I haven't confirmed this but I imagine the problem could surface for interface fragments as well. That should probably also be considered when solving this problem.

rhino88 commented 1 day ago

The experimental_type_conditioned_fetching flag should fix this in router 1.46+

dariuszkuc commented 1 day ago

👋 looks like we forgot to close this issue with the #2949 PR.

This should be available since fed v2.7.3 and supported by router version 1.46+. Thanks for the reminder!

apollographql / federation