Open ghost opened 1 year ago
Note: I've replaced the "steps to reproduce" above with references to a project that reproduces the problem. Specifically about 1 time in 20, one of the result-handling closures gets dropped without being called, and when that happens there always seems to be two operations with the same sequence number.
I haven't worked out all the steps in between, but it seems that (by inspection of the code) there is a race condition in requestId()
in OperationMessageIdCreator.swift, and when that results in repeated sequence numbers, surprising consequences occur.
@wilsonmhpn 👋🏻 - thanks for the detailed bug report and ability to reproduce. We'll take a look into this and get back to you soon.
Summary
It seems like it's possible to have a collision on Operation message IDs in OperationMessageIdCreator.swift:
The increment is atomic, but it if
requestId()
is accessed via different threads, I believe the same value of sequenceNumberCounter could be copied out twice before any increment occurs.We've seen sequence number collisions actually occur.
One consequence of this is unexpected behavior regarding the result handling closures passed to the
fetch
andperform
calls. We were seeing closures being deallocated without being called, and traced the dealloc down to this line in WebSocketTransport.swift:If two different operations are given the same identifier, this would result in the second displacing the first in the
self.subscribers
map, resulting in deallocation of the first operation's resultHandler (assuming that map is the only place it is referenced, and that seems to be the case).Version
1.0.6
Steps to reproduce the behavior
(Re-doing these steps, since I managed to replicate this in a project small enough to post.)
The code is at [link to apolloquestion0002 removed, outdated]- though it takes many restarts of the app to encounter it.
To start the server, go to the
apolloquestion0002/server
directory and runThen open the
apolloquestion0002.xcodeproj
project underapolloquestion0002/client/
, reset the package caches, and run. No user interaction is required, it will connect to port 4000 on localhost where it expects the Apollo server to be (as such it needs to be run in the simulator, on the same machine where the server is running).If the "CLOSURE WAS NOT CALLED" line at [link to apolloquestion0002 removed, outdated] is hit, that indicates that a closure to
perform
orfetch
was discarded without ever being called. If you examine the object dumps in the log, in these cases you should see that the sequence numbers for the two operations are the same.(I haven't delved into precisely how the repeated sequence number causes the closure to be dropped along the path through the code where the server is available; but assuming sequence numbers should never be repeated, I think showing that the numbers are repeated is sufficient to demonstrate a problem.)
So, the majority of the time, the two Operations have two different sequence numbers, and both of their result handling closures are called, and everything works. But once in a while a closure is de-allocated without being called, and the log output looks like this:
Note that
sequenceNumber
is 1 for two different operations.There is a comment in the code at [link to apolloquestion0002 removed, outdated] where I think I see a way to mitigate this. Even though the
fetch
andperform
calls are inside a closure towithCheckedContinuation
, applying@MainActor
to the surrounding function seems to result in the Apollo calls happening on the main thread. If all Apollo calls happen on the same thread then that would avoid the race condition in OperationMessageIdCreator.swiftrequestId()
. I'm not 100% sure how or why@MainActor
onasyncFetch
seems to propagate down through thewithCheckedContinuation
though...Logs
No response
Anything else?
No response