resetStore and refetchQueries can cause to refetch unmounted/inactive queries in StrictMode

Cellule commented 4 months ago

Issue Description

We recently upgraded to version 3 Right now we use resetStore for a particular scenario. All graphql requests sent have an "organizationId" header associated. Depending on that "organizationId" the user will see different content. When the user changes to another organization we navigate to an almost empty page for the transition (to unmount all organization specific queries) Then we call resetStore to refetch top-level queries that "might" be affected by the change of context. Once everything resolves, we navigate back to the normal app.

This used to work fine for the last 5 years, then we updated to apollo client version 3.

After some digging we found out that after calling resetStore some queries that are unmounted (confirmed 100%) end up being refetched. I debugged through the resetStore method, but I found no calls to refetch those queries. So right now I have no idea what could cause those unmounted queries to be requested once again.

The biggest problem I have right now is that because those queries are called from the wrong context, it causes another mechanism to automatically redirect to the previous organization (so direct links work in any organization) I have a lot of fail safe in place and it's fine 90% of the time, but when the network is slower those fail safe are not enough and it leads to the user no being able to change organization.

Any guidance on what could cause an unmounted query to hit the network once more after calling resetStore would be appreciated Or tips on how to find out

Link to Reproduction

Reproduction Steps

Go to "Home"
Go to "Home2"
Click "Reset Store" button
You'll notice the query "AllPeople" (only mounted in Home not Home2) will be sent when it should not

`@apollo/client` version

3.10.4

Cellule commented 4 months ago

Here are my findings so far (no repro still -_-)

I have a query on some page and it is set as "dirty" at some point https://github.com/apollographql/apollo-client/blob/d914d689abd9a3364cfa9c6bef8617d97df67a1c/src/core/QueryInfo.ts#L229-L234

Then the query is unmounted, but remains "dirty"

Later we call resetStore which calls reFetchObservableQueries https://github.com/apollographql/apollo-client/blob/d914d689abd9a3364cfa9c6bef8617d97df67a1c/src/core/ApolloClient.ts#L594-L603

In reFetchObservableQueries we call broadcastQueries https://github.com/apollographql/apollo-client/blob/d914d689abd9a3364cfa9c6bef8617d97df67a1c/src/core/QueryManager.ts#L969

In broadcastQueries we call notify for ALL queries even inactive ones https://github.com/apollographql/apollo-client/blob/d914d689abd9a3364cfa9c6bef8617d97df67a1c/src/core/QueryManager.ts#L1072-L1075

In notify for my problematic query, because it is "dirty" we call the listener which ends up calling reobserverCacheFirst on that query causing to send it to the server https://github.com/apollographql/apollo-client/blob/d914d689abd9a3364cfa9c6bef8617d97df67a1c/src/core/QueryInfo.ts#L251-L272

So it seems like at some point we should ignore "inactive" queries, aka observableQuery.hasObservers() is false

I think the right place to do this check would be in QueryInfo.shouldNotify https://github.com/apollographql/apollo-client/blob/d914d689abd9a3364cfa9c6bef8617d97df67a1c/src/core/QueryInfo.ts#L289-L292

Otherwise, maybe we should NOT call broadcastQueries in the first place from within reFecthObservableQueries ?

Cellule commented 4 months ago

Alright after going through this whole effort, I managed to make my minimal repro work!!

jerelmiller commented 4 months ago

@Cellule thanks for the bug report!

The reproduction is super helpful! I've cloned your reproduction and started looking into this to understand myself. Its close to the weekend and I'm not sure I'll be able to finish until early next week, but thanks so much for your thorough analysis. This will definitely save me a ton of time. I'll get back to you as soon as I can!

jerelmiller commented 4 months ago

@Cellule looking at this further, your diagnosis seems correct. I had originally assumed there was a bug holding onto the ObservableQuery in the this.queries map inside QueryManager, but that doesn't appear to be the case (hence my assumption in https://github.com/apollographql/apollo-client/issues/11894#issuecomment-2197707409).

I can now see that the only reason you're getting the refetch is because the ObservableQuery instance from the first page hasn't yet been garbage collected, so the notify call that you pointed out in your deep dive is triggering the refetch.

I've updated your reproduction to show that once ObservableQuery is garbage collected, we're able to see only a single network request (try clicking the "reset store" button after you see the log message about it getting garbage collected): https://codesandbox.io/p/devbox/fervent-scooby-c5xwk3. This at least confirms that we aren't holding onto that ObservableQuery instance somewhere else in the client 🎉.

I've discussed this with the team to make sure adding a check for hasObservers wouldn't have any other adverse effects, and we couldn't think of any. I think your suggested fix makes sense. I'd like to dig into the proper place for this, but otherwise I think we should move forward with this fix. If you're interested in submitting a PR, I'd be happy to review, otherwise I'll try and get something up within the next couple days.

Thanks again for your thorough deep dive through the code and the time getting a reproduction! This made my life so much easier ❤️.

Cellule commented 4 months ago

@jerelmiller Great to hear! I was about to open a PR then figured I should write a test for this. However, I can't figure out how to reach the state where there's an inactive ObservableQuery in the QueryManager. Actually, the more I review the code, the less I understand how it's possible to have an inactive query in the query manager 😅

jerelmiller commented 4 months ago

Haha right?!?! Thats why I was so confused why you were able to reproduce this because technically it shouldn't be possible 😆.

From what I understand, its the fact that the ObservableQuery instance still exists in memory and hasn't been garbage collected yet, so its able to respond to updates when broadcastQueries is run. The best way I could think to try and approach a test would be:

If you want to test this at the React level with useQuery:

Render a component with one query using useQuery.
After the fetch is complete, unmount that component and mount another component with another useQuery (a ternary would probably work just fine here). Make sure these are different queries so that query deduplication doesn't kick in giving you a false positive.
Execute client.reFetchObservableQueries
assert that only one of the two queries is executed.

If you opt for this approach, we have a handy Profiler utility that makes it easier to assert render-by-render. Here is an example of its usage with useQuery: https://github.com/apollographql/apollo-client/blob/1a9c68a4e6bc033498ce620681ae4a3720820ba4/src/react/hooks/__tests__/useQuery.test.tsx#L4177-L4432

If you want to test this with core APIs:

Create an observable with client.watchQuery
Subscribe to it, let the fetch finish, then unsubscribe from it
Create a 2nd observable with client.watchQuery. Ensure its a different query so that query deduplication doesn't kick in giving you a false positive.
Subscribe to it, let the fetch finish
Execute client.reFetchObservableQueries
Assert that only one of the two queries is executed.

If you opt for this approach, we have a handy ObservableStream utility that captures each emitted result that you can assert on in a more synchronous way. Here is an example of its usage: https://github.com/apollographql/apollo-client/blob/1a9c68a4e6bc033498ce620681ae4a3720820ba4/src/core/__tests__/ObservableQuery.ts#L2393-L2509

If you're struggling with either of these, let me know and I'd be happy to take a stab at creating a failing test for this to use as a base for implementing the change.

Cellule commented 4 months ago

So I'm not 100% sure why, but my current investigation leads me to this bit of code where InternalState.useObservableQuery will call this.client.watchQuery twice for the same query (can be seen on page load of my repro). The first one created seems to be abandoned, but remains in the list of query of the QueryManager.

https://github.com/apollographql/apollo-client/blob/116723d88b5600126cb266060ca58ce989d3988e/src/react/hooks/useQuery.ts#L531-L569

So in the end, it seems it's not an issue of query being cleaned up left in the QueryManager. It's a separate query that was never properly registered that is somehow in the list. When resetStore is called, that "never alive" query is revived. So maybe a proper fix would be to avoid this duplicate in the first place since I suspect there might be lots of "uninitialized" queries

jerelmiller commented 4 months ago

That sounds very familiar to https://github.com/apollographql/apollo-client/pull/11837 where I noticed something similar. Are you using React's strict mode by chance? I believe the the issue here was that the strict mode caused useState to run twice, which meant useInternalState was creating 2 instances of InternalState, which ran watchQuery twice: https://github.com/apollographql/apollo-client/blob/116723d88b5600126cb266060ca58ce989d3988e/src/react/hooks/useQuery.ts#L125

Take a look at the PR description in https://github.com/apollographql/apollo-client/pull/11837. Perhaps you might be seeing something similar. Funny enough, the fix for that one was also doing a check for hasObservers 😆

Cellule commented 4 months ago

It does look like StrictMode is the offender in the repro above where useRef in useInternalState is empty on both renders causing to create an ObservableQuery twice It would point to bad cleanup in useInternalState.

However, I'm hitting this issue in production where StrictMode should do nothing. Let me see if I can repro without StrictMode

Cellule commented 4 months ago

My biggest fear is that there's a memory leak somehow. I haven't been able to confirm it either in a repro nor in our production application, so I suppose that for now the initial patch should do the trick. I'll try to monitor afterward in case memory leak occurs

As for testing, I managed to write a test that fails without the fix and passes with it, it's not the cleanest approach, but let's discuss this on the PR directly

jerelmiller commented 4 months ago

Thanks for the PR! I'll take a look when I'm back on Monday. Have a great weekend!

Cellule commented 1 month ago

@jerelmiller Someone on my team, @lorenzopicoli , found another related issue. In one of our mutations, we're using the refetchQueries option

const [mutation] = useMutation<ICreateAssetMutationResp, ICreateAssetMutationArgs>(
    createAssetMutation,
    {
      ignoreResults: true,
      refetchQueries: [
        // needed because a sub-asset can be created at any level in the sub-assets page
        // called if in /assets/:assetId/subs, to refresh the descendants count in treeView
        "AssetParentTreeView",
        // called if in /assets/:assetId, to refresh the descendants count in sub-asset tree-view
        "AssetDetailsQuery",
      ],
    }
)

I went down through the code and ended up in QueryManager.getObservableQueries https://github.com/Cellule/apollo-client/blob/823b26e1073ae9c50d7c70078ea99e758975dab6/src/core/QueryManager.ts#L852-L900 It seems if include is an array of strings, it'll return queries matching the name even if they're not active. I feel this is a mistake, but I can't 100% confirm. I'll add a proposed fix in my PR for you to take a look

Cellule commented 1 month ago

I dug a bit more into it. From what I can see here's what happen (exacerbated by StrictMode, but possible without)

useQuery is called
client.watchQuery is called creating an observableQuery
React.useEffect() (actually useSyncExternalStore) is called to begin subscribing to the subscription
Something causes the component to unmount immediately (like StrictMode). The useEffect to start the subscription is never called Apparently this is possible when mounting/unmounting component relatively fast
Another query mount and goes through the process normally
InMemoryCache calls broadcastWatch after getting results from the "second" query, this causes the first query to become dirty
And bad state reached

Cellule commented 1 month ago

So clearly StrictMode is being a huge offender of this problem. I can see that the useState initializer function runs twice in StrictMode https://github.com/apollographql/apollo-client/blob/80471816396fa023fa2dc3b768ceb3da5b22edf4/src/react/hooks/useQuery.ts#L179-L205

https://react.dev/reference/react/useState#caveats Since there's no proper "cleanup" for the observable that is created in useState, we always end up with 2 observable for every query. According to this https://github.com/facebook/react/issues/20090#issuecomment-715926549 Creating a watcher/listener in useState is an anti-pattern and that's why it causes such problems in StrictMode

I'll try to come up with a better repro without StrictMode, but I do believe we should figure out and fix that issue in StrictMode as well to respect the rules of React

jerelmiller commented 1 month ago

@Cellule thanks for that confirmation. I suspected strict mode was the cause of at least one of the cases and agree we should try and do a better job here to avoid he double initialization. Do let me know if you're able to reproduce it without strict mode though. I'd be curious how that is allowed to happen, but my guess is the underlying cause is still very similar (e.g. multiple ObservableQuery instances are created)

Cellule commented 1 month ago

Now I'm starting to wonder if the main issue was fixed (aka dead queries in production) When I initially encountered this issue we were on apollo-client 3.10.4, we are currently on 3.10.8 and I can't seem to repro I reviewed the diff between those 2 versions and it does look like something changed around how useQuery handles its state v3.10.4...v3.10.8 Or more specifically from this PR https://github.com/apollographql/apollo-client/pull/11851

You probably have more context around that particular change, do you think this solved it ? If yes, then maybe we just have to focus on fixing the useState(createInternalState) in StrictMode

jerelmiller commented 1 month ago

@Cellule ok thats good news! That change was made to provide better compatibility with the new React Compiler. We were violating a few rules of React that would have made it difficult to integrate with it. Very possible the move away from the ref and into state helps here.

Cellule commented 1 month ago

Great, I updated the title of the issue to better represent the issue.

apollographql / apollo-client