Open kyle-painter opened 3 months ago
Hi @kyle-painter!
Thank you for your pull request and welcome to our community.
In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.
In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.
Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed
. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.
If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!
For further context, I've only observed this issue in a local app environment (using a task scheduler enforcing batched updates) and had very little success trying to debug what are the magic series of inputs/outputs that lead to the error. I also attempted to create a minimal reproduction from the relay-examples repo and again no luck.
To briefly summarise we have a connection of rows in a table and we execute a mutation to create a new row. This row is optimistically appended to the connection using @appendNode
, and when a network error occurs Relay will attempt the rollback.
<Table ref="client:root:tableConnection">
<Row ref="client:root:tableConnection:edges:0" />
<Row ref="client:root:tableConnection:edges:1" />
<Row ref="client:root:tableConnection:edges:2" />
<Table>
Mutation is executed
<Table ref="client:root:tableConnection">
<Row ref="client:root:tableConnection:edges:0" />
<Row ref="client:root:tableConnection:edges:1" />
<Row ref="client:root:tableConnection:edges:2" />
<Row ref="client:root:tableConnection:edges:3" /> // Row is appended after optimistic update
<Table>
Network error and rollback occurs. During the rollback a synchronous update is flushed causing the table to be re-rendered. Interestingly, the table connection still includes 4 edges, but while trying to call useFragment
from within the Row
component Relay cannot find the record and logs the warning.
Warning: Relay: Expected to have been able to read non-null data for fragment `?` declared in `useFragment()`, since fragment reference was non-null. Make sure that that `useFragment()`'s parent isn't holding on to and/or passing a fragment reference for data that has been deleted.
A very obscure issue to try and debug. The rendering behaviour appeared to change based on which fields I specified within my Row fragment, i.e. with Row.fieldA
the error occurs, and without Row.fieldA
it behaves as expected.
What I did narrow down however, is that rollback works as expected when a successful GraphQL payload is returned, and when removing the configured task scheduler it fails in the same fashion. This is how I landed on the above outcome. We could also consider moving the scheduler down specifically to where the rollback is executed during cancel
so the current state update can remain synchronous.
Ok after a lot of debugging I was able to narrow down a reproduction which can affect both successful and failed network responses.
There are a couple of key components at play:
useFragment
hook caching the optimistic node, so when it attempts to render the stale connection it can't serve the data from cache and fails to look the record up from the store.Configuring a batched task scheduler resolves this issue for the success path, however it still crashes during a failed network response as it doesn't leverage the configured scheduler. While these may be unorthodox conditions to meet I do believe using the task scheduler to perform these rollbacks fundamentally makes sense (as how payloads are currently processed).
Successful payload
Successful payload with batched task scheduler
Failed network payload
cc @captbaritone for your review/feedback 😄
@kyle-painter Thanks for reporting this, digging in, and working on a fix. Sorry I haven't gotten back to you sooner. I'll spend some time today reviewing and looking at the case you've collected. As I'm sure you've noticed changes like this can be very hard to reason about. One thing we might want to do is put the change behind a feature flag so that we can roll it out and quickly opt back into the old behavior if we observe any issues. If we can confirm that we don't notice any issues, we can remove the feature flag.
Runtime feature flags are currently implemented as a global singleton here. You can add a new option and then conditionally use the scheduler depending upon what value is set.
Thanks again for your investigation here and apologies for the lagging response.
@captbaritone has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Importing this to play around with it a bit. Still trying to ensure I can fully wrapped my head around the implications here.
No problem @captbaritone I'll try and get the changes behind a feature flag some time this week.
So, the bad update happens on the rollback in response to the mutation error. During the rollback, the optimistically added list item is removed. This makes sense to me
This is important as it allows a parent component to be notified of updates and re-rendered first, which will then render children as per the normal React flow which still reference the stale connection data.
I'm a little confused by this sentence. I would expect that if a parent (containing the connection) rerenders before the children it would observe the new state and we would get a correct reflow where it would rerender the new set of children with the removed list item not present. I was expecting the cause here to be the inverse. For example, if a child is earlier in the subscriptions array than its parent, and we notify in order and unbatched, we would rerender the component that was rendering the now deleted item (triggering the error) before the parent had a chance to rerender and cause that child to unmount.
Am I reading your summary correectly? If the parent really is rerendering first but with the stale connection items, do you know what's leading the parent to get rerendered but with the stale connection data?
I'm also trying to wrap my head around why React's auto-batching is not helping us here. My understanding is that promise resolve should be autobatched as of React 18. I tried upgrading your example project to React 18 and the error still seems to reproduce.
I can continue to dig on both of these fronts (thanks again for your detailed report with repro!) but I figured I'd ask directly in case you already had developed an understanding of the cause here during your investigation.
I would expect that if a parent (containing the connection) rerenders before the children it would observe the new state and we would get a correct reflow
In the example app, both the parent and the child retrieve the connection from the fragment separately. The parent in this instance does have the correct connection data, however I believe when the child reads the fragment it's serving the stale, cached connection.
TodoApp (TodoConnection) # This is rendered first
| - TodoList (TodoConnection) # Rendered by the parent, serving stale connection data from cache
| - Todo
| - Todo
| - Todo
If the TodoList
were to be rendered first in this instance, then it would behave as you've suggested.
I tried upgrading your example project to React 18 and the error still seems to reproduce.
Hmm I haven't attempted to upgrade myself, did you update the app to use ReactDOM.createRoot
? If not I can try give this a go as well.
Ok I've given it a test with createRoot
and automatic batching does appear to resolve the issue as well which is good to know.
Ah yeah, I had forgotten that the implicit batching was dependent upon using createRoot. I can confirm that once I did that I also was unable to reproduce the issue.
Operations in Relay will be flushed synchronously by default which has been observed to cause potential race conditions in the client app. In practice this is very difficult to reproduce and is dependant on various factors that may affect the component render order of the application.
An effective method to mitigate these potential errors is by configuring a task scheduler to batch updates, however, the
OperationExecutor
doesn't use the scheduler when processing optimistic responses, or during rollback after a network error meaning these events are still susceptible to obscure race condition errors. This PR uses the configured scheduler in these scenarios to help alleviate this class of error.Example of an invalid render due to such an error:
Some additional internal feedback from @tomgasson regarding these changes:
On that note, it would be excellent to get additional feedback from the maintainers on if this unexpected behaviour has been observed elsewhere, and are there other approaches you'd suggest to avoid/solve this scenario.