Subsequent queries can cause previous data to appear "missing" due to normalization

mike-marcacci commented 6 years ago

I have a fairly common, very reasonable use case that demonstrates what I believe to be a fundamental flaw in the relay cache/store update strategy.

Let's use these simplified types in the scenario:

type SomeObject implements Node {
    id: ID!
    name: String
    description: String
}

type SomeWrapper implements Node {
    id: ID!
    object: SomeObject
}

type Query {
  get(id: ID!): SomeWrapper
  search(term: String!): [SomeWrapper]
}

The important part here is that SomeWrapper references SomeObject.

In my use case, I have two QueryRenderers: one to fetch search results, and one to fetch all the details of a selected object. They must be independent queries for 2 reasons:

the details query fetches MUCH more data than would be practical to return on all search results
search query should NOT be re-run each time the selected ID changes for the details query

Here's another heavily simplified description in GraphQL (in reality the queries are large, use the Connection spec, and are composed of many fragments):

query SearchQuery ($term: String!) {
    search (term: $term) {
        id
        object {
            id
            name
        }
    }
}

query DetailsQuery ($id: ID!) {
    get (id: $id) {
        id
        object {
            id
            name
            description
        }
    }
}

The important part here is that the description field is present in DetailsQuery but NOT in SearchQuery.

The problem comes when I have details pulled up for a Wrapper with ID aaa. It looks like this:

{
    "id": "aaa",
    "object": {
        "id": "111",
        "name": "Name 1",
        "description": "This is a HUGE piece of data."
    }
}

Now, let's say that I do a search and receive results which contain Wrapper with ID aaa. It looks like this:

[
    {
        "id": "aaa",
        "object": {
            "id": "111",
            "name": "Name 1"
        }
    }
]

Nothing has changed, and this is great. However, let's say that aaa has changed to reference Object with ID 222:

[
    {
        "id": "aaa",
        "object": {
            "id": "222",
            "name": "Name 2"
        }
    }
]

This becomes EXTREMELY problematic. Because 222 is new to relay, there is no cache for the description field, but the results for DetailsQuery are updated to reference it anyhow. While having a "description" go missing doesn't sound like a big deal, my real-world case has dozens of fields with nested queries. Thus, receiving this kind of search result essentially makes the entire screen go blank.

The problem, then, is that relay seems to split the middle between two more predictable updating strategies, where it would either:

NOT update the DetailsQuery results, since it knows that its cache is incomplete
Make a followup query to the required node(id: ID!) query root for the fields it knows are missing

I believe I can acheive the former by using separate environments, but this feels quite dirty.

If I recall correctly, the latter was the behavior in relay classic. Is there any way to restore this in modern?

I would be very interested to learn the reasoning behind the current behavior, and am willing to helping implement a better one if the relay team is open to the idea.

mike-marcacci commented 6 years ago

This is related to #1653

mike-marcacci commented 6 years ago

I'd like to bump this issue, as it's a pretty critical flaw and without a good understanding of why the relay team chose to go this direction, it's difficult for somebody in the community to address it.

Since my issue advocates for a pretty big change, I want to make sure my ideas are compatible with your vision and goals for the project before I invest the work to make it happen. Thanks in advance!

devknoll commented 6 years ago

Hi Mike.

Could you provide a little more information about your scenario? When is ‘aaa.object’ being changed? Why is it changing? If it’s through a mutation, why isn’t the mutation fetching the new data? If it’s not a mutation, what’s changing it?

Gerald

mike-marcacci commented 6 years ago

Hi @devknoll! Thanks so much for the followup, and sorry if the example above is a little hard to follow. The challenge here is that the node with ID aaa has changed on the server such that its field object references node 222 instead of 111; this change isn't caused by the client who experiences the problem, but by another user or server process. Relay hasn't ever asked the server for the description field of 222, but still updates the results of DetailsQuery with the incomplete results from the SearchQuery response, setting the description to null.

To be much more concise, the problem is that when there are multiple QueryRenderers on a page, simply receiving the results from one can produce a cache state that is incorrect for the others. Specifically, this occurs when one query returns a Node that is new to the store; any referencing Nodes are updated to point at the new Node, even though the necessary child fields have never been queried.

Hopefully that helps – please let me know if I can help further clarify anything!

mike-marcacci commented 6 years ago

Also, my question about the team's vision going forward comes from my observation that in Relay Modern, your goals included better predictability of which queries run and when they run, and to accomplish this, you moved from dynamic queries to fully-static ones. The fix for this, though, is probably going to require some kind of generated query to fill in the missing fields.

robrichard commented 6 years ago

@mike-marcacci can you just refetch your DetailsQuery after the second SearchQuery is executed? In your second proposal (making a follow up query), it sounds like the query you are expecting relay to generate would be almost identical to the DetailsQuery you already wrote.

mike-marcacci commented 6 years ago

Hi @robrichard, thanks so much for your comment! I think your suggestion is really great, and it's basically what I've been working on in a custom extension of QueryRenderer. This has the advantage of keeping everything static, while fixing the issue I describe for many use cases (particularly the simplified one I described above).

This strategy, though, doesn't work for 2 kinds of queries.

Queries which are very deep or contain long lists: it would be impractical to re-run the entire query every time a descendent node is changed in this way.
Queries with results that change frequently: we don't want to disorient our users by spontaneously making large, seeming unrelated changes to the UI. (Of course, this might be addressed schema-side with more deterministic semantics, more durable cursors, etc.)

To me, the most "correct" thing to do is batch up node($id) queries with the unrequested fragments/fields. Of course, I haven't actually tried to do this, and I totally understand if those who had to build/maintain this in relay classic have discovered that it just isn't worth the complexity.

I'm going to add some tests that demonstrate the issue and make some general-purpose changes to QueryRenderer to re-run the static query. I'm still working through distinguishing between a field that has been queried and returned null and one that hasn't ever been requested, but I'll throw up a PR once it's working.

mike-marcacci commented 6 years ago

So I never finished writing my new QueryRenderer but am going to have to pick it back up. To make sure I'm conveying the severity of this issue, fields guaranteed by the API to be non-null can come become undefined because of this, making the generated flow types completely unreliable. In my case, somebody can crash an entire app just by typing into a search field, if the result list contains an updated Node that's used elsewhere on the screen.

I'm going to put together a small app to demonstrate this issue, but I'm quite amazed that others don't appear to be running into this like crazy.

mike-marcacci commented 6 years ago

OK, I finally got around to building a full demo. Check out this repo and this video.

Again, I'd like to stress the seriousness of this, as I suspect it is crashing apps across the web under very hard-to-reproduce conditions (because it is triggered by cross-user concurrency). I also want to note that I've explored a few ideas for a fix, but that there are architectural consequences to this.

mike-marcacci commented 6 years ago

Here are my current thoughts on how this might be fixed; I wish I could have a whiteboard session with somebody more knowledgable than me on relay's internals, but since we're just in an issue thread please take these as "first thoughts" rather than a well constructed proposal:

Detection

This issue can be created many different ways. While my example demonstrates it when a new Node is introduced to the store, it's possible for the Node to already exist with insufficient data, and later get referenced where it doesn't belong. This makes it clear that detecting insufficient data needs to occur when it is used.

Idea 1

The most obvious way to detect missing data is to, whenever the data is provided to a QueryRenderer, iterate through the query (with all its fragments) and check every Node for the existence of each requested field.

Pros:

would detect all missing fields

Cons:

could get expensive on large queries with many fields
requires that "undefined" fields are different from "null" fields in the store (might already be the case?)

Idea 2

Another option is to flag nodes with the query and position when its data is returned by a particular query, using a "private" client-side field. Then, whenever data is provided to a QueryRenderer, iterate through every returned Node to check the presence of the corresponding flag.

Pros:

would detect all missing fields
less expensive
no need to differentiate between "undefined" and "null" fields in the store

Cons:

a programmatically inserted Node (such as in optimistic updates) would need to specify which queries it satisfies, leaving a footgun when a new field is required
some queries might be implicitly satisfied, resulting in unnecessary resolution

Resolution

While detecting missing data is reasonably straightforward, resolving it isn't.

Idea 1

When a QueryRenderer detects missing fields, it can re-run its query to the server. This would add the missing fields.

Pros:

would successfully resolve missing data
doesn't add a ton of extra complexity

Cons:

could cause unexpected "updates" if the query was time-sensitive (such as "5 most recent posts")
could result in a lot of over-querying, if the queries are large (returning 100 Nodes just because 1 is missing data)

Idea 2

When a QueryRenderer detects missing fields, it can re-request content for the specific Node using the node(id: X) API. This is similar to how relay classic worked, but the queries could be preassembled at compile time.

Pros:

would successfully resolve missing data
wouldn't cause unexpected "updates"
wouldn't cause nearly as much querying as Idea 1

Cons:

adds a serious amount of complexity
would only gain performance benefits if "batching" were supported by the network

After thinking about this for a bit, both Detection Idea 1 and Resolution Idea 1 seem like the simpler options, even though they aren't perfect. I would be very interested in hearing thoughts by somebody on the relay team, and if anybody actually would like to hash this out live, let me know and I can Skype or whatever.

juhaelee commented 6 years ago

I believe I am having a similar issue not with just queryrenderer's, but also with refetch containers: https://github.com/facebook/relay/issues/2425

sibelius commented 6 years ago

You can use this https://github.com/facebook/relay/issues/2100#issuecomment-330034633 as a workaround

juhaelee commented 6 years ago

@sibelius i have found that this workaround does not work for me. is relay suppose to completely replace the node in the store when using the refetch container? Why would it not just fetch the data and merge it into the existing store?

Since relay is replacing the entire node, many fields become undefined in the store after the refetch. The only workaround I'm seeing is including all the fields within the refetch fragment. However, as @mike-marcacci pointed out, it's not practical to fetch all the details when there's a lot of data in certain situations

sibelius commented 5 years ago

Relay team is working on fragment ownership model that could probably solve this issue

Added fragment ownership model to relay-runtime: fragments can now point to the query that owns them, which removes reliance on React Context and gives us flexibility to experiment with new apis.

josephsavona commented 5 years ago

I'd like to bump this issue, as it's a pretty critical flaw and without a good understanding of why the relay team chose to go this direction, it's difficult for somebody in the community to address it.

First, thanks for filing this issue. We informally call this the "missing data" scenario (an incredibly imaginative name, i know): some component fetches a field on a related object (say, a user.bestFriend.name), some update occurs where the relationship has changed (user.bestFriend is someone else now), but doesn't fetch the name field, and the first component observes that user.bestFriend.name is now missing. This can definitely be surprising if you aren't expecting it and seem like a flaw: however, this behavior is expected and by design.

The reason is consistency: the alternative (using the same example) is that when the update occurs and user.bestFriend has changed, the original component isn't notified and continues to show the previous friend. In other words: the main alternative is for the UI to be inconsistent.

In client side caches there is a fundamental tradeoff between completeness (do we have all the data each component requested), consistency (do all components represent a consistent view of the world), and performance (do we achieve those properties in a reasonable amount of time). In general a system can achieve two out of three. For example, to prioritize completeness and performance you can simply fetch each view's data in-full, up front in a single round trip, and never update that data. To prioritize completeness and consistency you can refetch anytime data goes missing - at the cost of possibly infinite refetches if data keeps changing on the server (and note that if you care about consistency, you can't just apply the first update bc that wouldn't be consistent - you have to wait to show the affects of an update until you know you're not missing any more data). Relay chooses a third route: prioritize consistency and performance - in our experience, "missing data" due to relationship changes are relatively more rare and also relatively easier to work around when they occur (for example, fetch the same fragment in both places to ensure data stays in sync).

That said, we haven't documented this tradeoff clearly anywhere so it can be surprising when it occurs. We should definitely document it somewhere - we'd appreciate feedback about where folks would intuitively expect to see this (in the architecture docs? but would enough people read those?). Also note that we're working on new react-relay APIs that use Relay's knowledge of which fields are missing (Snapshot.isMissingData is true whenever we encounter an unfetched field while reading a fragment). We're also working on better debug logging for when this case occurs.

In summary: this is a fundamental aspect of Relay's design that we do not plan to change. We're very open to documentation changes to make this behavior more clear, and we're working on some implementation changes that make it more obvious when this occurs (and when it does, why).

josephsavona commented 5 years ago

Relay team is working on fragment ownership model that could probably solve this issue

@sibelius You're correct that fragment ownership is related - it's part of the changes that make it easier to detect and debug this case - but it doesn't prevent this from occurring.

jckw commented 5 years ago

Just to confirm: this does make the auto-generated types nothing more than suggestions, right?

Since really we're allowing Partial types to be returned at any point because of the "missing data" problem, which is not that much better than just using any (aside from the IDE benefits), there aren't any guarantees.

josephsavona commented 5 years ago

Just to confirm: this does make the auto-generated types nothing more than suggestions, right?

Not quite: any can be anything, the situation here is about nullability. Generated types are mostly accurate, the primary reason they would diverge is that generated types reflect the nullability of fields in the schema, while due the missing data scenario fields can in practice be missing at any time. In our schema most fields are nullable anyway so this makes the generated types generally accurate. We're open to a PR to add a compiler option to generate all fields values as nullable.

taion commented 5 years ago

In our schema most fields are non-nullable

Do you mean "nullable" here?

mike-marcacci commented 5 years ago

@jckw yes, this makes the auto-generated types incorrect. For example, you may have an auto-generated type like:

type Foo {
  id: string;
  name: string;
  description?: string;
}

Both id and name are guaranteed to exist, while description may be null/undefined. Your static type checking will allow you to do something like myFoo.name.length because it trusts that myFoo.name is a string. However, this makes it possible for myFoo.name to be undefined, resulting in a crash.

The generated type is correct given the schema; it's relay itself that introduces this inconsistency.

mike-marcacci commented 5 years ago

I'm sure there are apps where this problem never crops up; however, for me this was a prevalent and serious issue that couldn't be fixed without major changes to relay itself. We ended up moving everything (across multiple react apps) to Jayden Seric's graphql-react plus the Apollo CLI's code generation tools. This library avoids all these problems by not using a normalized cache. While this does mean there's less that I "get for free," this approach was a more-than-welcome tradeoff for predictability and consistency.

mike-marcacci commented 5 years ago

We're open to a PR to add a compiler option to generate all fields values as nullable.

This would definitely be an improvement, as it would at least make static typing correct. The problem, though, is that data is still missing and so even if your app doesn't crash, you don't have the fields you need.

I'm definitely interested in the idea of fragment ownership and how that may solve this issue; normalized cache is a really, really hard problem.

josephsavona commented 5 years ago

Oops typo - yes I meant that most of fields in our schema are nullable.

josephsavona commented 5 years ago

@mike-marcacci I'm glad that you found a solution that works for your use-case. As I mentioned in my earlier comment, there are different tradeoffs across completeness, consistency, and performance, and there isn't one approach that is appropriate for every app. Relay focuses on cases where consistency matters: if you don't need consistency then a simpler/lighter solution can be more appropriate.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

agg23 commented 3 years ago

This is still an ongoing discussion as Relay develops, and should not be closed.

josephsavona commented 3 years ago

As I noted above, Relay uses a normalized cache and prioritizes consistency and performance which means that data can sometimes be incomplete. The behavior observed here - of data for a previous query going "missing" when a subsequent query exposes relationship changes in the object graph - is an expected aspect of Relay's design that we considerable acceptable and preferable to alternatives.

I'll update the title to reflect this, though I agree that we should keep this issue open to track improvements to the docs and APIs that allow developers to mitigate this in the cases when it is problematic.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

trey-miller commented 4 months ago

Hi I know this issue is a little old by now, but it's introducing insidious bugs in our code, where we rely on the types being accurate. It seems odd to me to delete defined field data if some other query happens to load the same object but didn't request that field. Why not keep the previous field's data if not requested by the subsequent query?

Relay knows which fields are requested, so if a field wasn't requested, why assume it's undefined instead of just, well, not requested, and then keep using the old data on the field if that field was not requested? Isn't it a bit presumptuous to assume the data has changed on the server to delete data in a field, simply because a later query didn't request that specific field? ESPECIALLY when the assumption is violating the schema, like assuming a field is undefined when it's a required field in the schema.

On the one hand, you could end up showing stale data, but on the other (current) hand, you get subtle bugs causing catastrophic runtime errors...

EDIT: I wanted to clarify after re-reading the above discussion, that the scenario in my situation is simpler than the ones described above. In my scenario, the server data has not changed. The only thing that happened is a later query encountered the same objects as some previous query, but the fields requested were different between the two queries.

For example:

query FirstQuery {
  objects() {
    edges {
      node {
        foo
        bar
      }
    }
  }
}

query SecondQuery {
  objects(someFilter: true) {
    edges {
      node {
        bar
      }
    }
  }
}

Say the FirstQuery component is showing a list of the results and their foo and bar values. Then SecondQuery comes in, the someFilter filter doesn't have much impact so the response list is essentially the same. Relay cache clobbers the first query's data and removes all foo values, so now the FirstQuery component has missing foo data.

Now imagine further, that there is code relying on foo to be defined. Say it's a value passed as a required variable in a refetch, well now you have a 422 error because the schema says you must use a defined value for this query variable, and our typescript compiler told us this field is always defined, because the Relay compiler told us it would be.

This is an example of a subtle runtime bug caused by this "missing data" behavior. How can we code if we can't rely on our types? Are we to test every single field in every single fragment for null/undefined now because it actually could be, not because the schema says so but because Relay decided to delete field data for a previous query simply because a subsequent query didn't query for that field?

It seems to defy a main point of Relay, where components are supposed to be able to declare the fields they need, without being affected by other components' field requests.

Edit 2: I wanted to share a quick follow-up, I have discovered one thing that actually made my situation more prone to this problem: The field foo above actually represents a field called id in my types. By default, Relay will assume this is the DataID of the object just due to its name, and use it as the key for normalization in the cache. So it saw an object with no id value as a different object than the one with an actual value. So I see now why i was encountering the issue even though the server-side data hasn't changed - as far as Relay could see, it did change. This knowledge about how the id field is special and how it impacts things here is helping me develop ways to work around this issue (most likely involving a custom getDataID). ✌️

facebook / relay