kyle-painter commented 3 months ago

Operations in Relay will be flushed synchronously by default which has been observed to cause potential race conditions in the client app. In practice this is very difficult to reproduce and is dependant on various factors that may affect the component render order of the application.

An effective method to mitigate these potential errors is by configuring a task scheduler to batch updates, however, the OperationExecutor doesn't use the scheduler when processing optimistic responses, or during rollback after a network error meaning these events are still susceptible to obscure race condition errors. This PR uses the configured scheduler in these scenarios to help alleviate this class of error.

Example of an invalid render due to such an error:

Warning: Relay: Expected to have been able to read non-null data for fragment `?` declared in `useFragment()`, since fragment reference was non-null. Make sure that that `useFragment()`'s parent isn't holding on to and/or passing a fragment reference for data that has been deleted.

Some additional internal feedback from @tomgasson regarding these changes:

In terms of this change itself I'm not sure if this is valid. As I understand it, most of the OperationExecutor runs synchronously (and most of relay internally throughout) and it really tends to expect single-threaded and sequential execution. For example, this.cancel() is intended to update the current state (this._state = ) and deferring that means that iteration could continue or fall into an infinite loop. The idea of scheduling throughout relay is not to defer any part of computation, but to batch together things which are already going to be handled concurrently anyway: notably DOM updates and fetching

On that note, it would be excellent to get additional feedback from the maintainers on if this unexpected behaviour has been observed elsewhere, and are there other approaches you'd suggest to avoid/solve this scenario.

facebook-github-bot commented 3 months ago

Hi @kyle-painter!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

kyle-painter commented 3 months ago

For further context, I've only observed this issue in a local app environment (using a task scheduler enforcing batched updates) and had very little success trying to debug what are the magic series of inputs/outputs that lead to the error. I also attempted to create a minimal reproduction from the relay-examples repo and again no luck.

To briefly summarise we have a connection of rows in a table and we execute a mutation to create a new row. This row is optimistically appended to the connection using @appendNode, and when a network error occurs Relay will attempt the rollback.

<Table ref="client:root:tableConnection">
  <Row ref="client:root:tableConnection:edges:0" />
  <Row ref="client:root:tableConnection:edges:1" />
  <Row ref="client:root:tableConnection:edges:2" />
<Table>

Mutation is executed

<Table ref="client:root:tableConnection">
  <Row ref="client:root:tableConnection:edges:0" />
  <Row ref="client:root:tableConnection:edges:1" />
  <Row ref="client:root:tableConnection:edges:2" />
  <Row ref="client:root:tableConnection:edges:3" />  // Row is appended after optimistic update
<Table>

Network error and rollback occurs. During the rollback a synchronous update is flushed causing the table to be re-rendered. Interestingly, the table connection still includes 4 edges, but while trying to call useFragment from within the Row component Relay cannot find the record and logs the warning.

Warning: Relay: Expected to have been able to read non-null data for fragment `?` declared in `useFragment()`, since fragment reference was non-null. Make sure that that `useFragment()`'s parent isn't holding on to and/or passing a fragment reference for data that has been deleted.

A very obscure issue to try and debug. The rendering behaviour appeared to change based on which fields I specified within my Row fragment, i.e. with Row.fieldA the error occurs, and without Row.fieldA it behaves as expected.

What I did narrow down however, is that rollback works as expected when a successful GraphQL payload is returned, and when removing the configured task scheduler it fails in the same fashion. This is how I landed on the above outcome. We could also consider moving the scheduler down specifically to where the rollback is executed during cancel so the current state update can remain synchronous.

kyle-painter commented 3 months ago

Ok after a lot of debugging I was able to narrow down a reproduction which can affect both successful and failed network responses.

There are a couple of key components at play:

Subscriptions generally will be added to the set bottom-up, meaning children will be notified of relevant data updates first. However, if an intermediary component rendering the connection is re-mounted then its subscription is deleted an appended to the end of the set. This is important as it allows a parent component to be notified of updates and re-rendered first, which will then render children as per the normal React flow which still reference the stale connection data.
There is missing data when committing the optimistic response to the store. This prevents the useFragment hook caching the optimistic node, so when it attempts to render the stale connection it can't serve the data from cache and fails to look the record up from the store.

Configuring a batched task scheduler resolves this issue for the success path, however it still crashes during a failed network response as it doesn't leverage the configured scheduler. While these may be unorthodox conditions to meet I do believe using the task scheduler to perform these rollbacks fundamentally makes sense (as how payloads are currently processed).

Successful payload

Successful payload with batched task scheduler

Failed network payload

kyle-painter commented 3 months ago

cc @captbaritone for your review/feedback 😄

captbaritone commented 2 months ago

@kyle-painter Thanks for reporting this, digging in, and working on a fix. Sorry I haven't gotten back to you sooner. I'll spend some time today reviewing and looking at the case you've collected. As I'm sure you've noticed changes like this can be very hard to reason about. One thing we might want to do is put the change behind a feature flag so that we can roll it out and quickly opt back into the old behavior if we observe any issues. If we can confirm that we don't notice any issues, we can remove the feature flag.

Runtime feature flags are currently implemented as a global singleton here. You can add a new option and then conditionally use the scheduler depending upon what value is set.

Thanks again for your investigation here and apologies for the lagging response.

facebook-github-bot commented 2 months ago

@captbaritone has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

captbaritone commented 2 months ago

Importing this to play around with it a bit. Still trying to ensure I can fully wrapped my head around the implications here.

kyle-painter commented 2 months ago

No problem @captbaritone I'll try and get the changes behind a feature flag some time this week.

captbaritone commented 2 months ago

So, the bad update happens on the rollback in response to the mutation error. During the rollback, the optimistically added list item is removed. This makes sense to me

This is important as it allows a parent component to be notified of updates and re-rendered first, which will then render children as per the normal React flow which still reference the stale connection data.

I'm a little confused by this sentence. I would expect that if a parent (containing the connection) rerenders before the children it would observe the new state and we would get a correct reflow where it would rerender the new set of children with the removed list item not present. I was expecting the cause here to be the inverse. For example, if a child is earlier in the subscriptions array than its parent, and we notify in order and unbatched, we would rerender the component that was rendering the now deleted item (triggering the error) before the parent had a chance to rerender and cause that child to unmount.

Am I reading your summary correectly? If the parent really is rerendering first but with the stale connection items, do you know what's leading the parent to get rerendered but with the stale connection data?

I'm also trying to wrap my head around why React's auto-batching is not helping us here. My understanding is that promise resolve should be autobatched as of React 18. I tried upgrading your example project to React 18 and the error still seems to reproduce.

I can continue to dig on both of these fronts (thanks again for your detailed report with repro!) but I figured I'd ask directly in case you already had developed an understanding of the cause here during your investigation.

kyle-painter commented 2 months ago

I would expect that if a parent (containing the connection) rerenders before the children it would observe the new state and we would get a correct reflow

In the example app, both the parent and the child retrieve the connection from the fragment separately. The parent in this instance does have the correct connection data, however I believe when the child reads the fragment it's serving the stale, cached connection.

TodoApp (TodoConnection) # This is rendered first
| - TodoList (TodoConnection) # Rendered by the parent, serving stale connection data from cache
     | - Todo
     | - Todo
     | - Todo

If the TodoList were to be rendered first in this instance, then it would behave as you've suggested.

I tried upgrading your example project to React 18 and the error still seems to reproduce.

Hmm I haven't attempted to upgrade myself, did you update the app to use ReactDOM.createRoot? If not I can try give this a go as well.

kyle-painter commented 2 months ago

Ok I've given it a test with createRoot and automatic batching does appear to resolve the issue as well which is good to know.

captbaritone commented 2 months ago

Ah yeah, I had forgotten that the implicit batching was dependent upon using createRoot. I can confirm that once I did that I also was unable to reproduce the issue.

facebook / relay

Use task scheduler to process optimistic responses and execute error rollback #4641

Action Required

Process