Thread-related collision on sequence number results in Operations releasing resultHandlers without calling

ghost commented 1 year ago

Summary

It seems like it's possible to have a collision on Operation message IDs in OperationMessageIdCreator.swift:

  public func requestId() -> String {
    let id = sequenceNumberCounter
    _ = $sequenceNumberCounter.increment()

    return "\(id)"
  }

The increment is atomic, but it if requestId() is accessed via different threads, I believe the same value of sequenceNumberCounter could be copied out twice before any increment occurs.

We've seen sequence number collisions actually occur.

One consequence of this is unexpected behavior regarding the result handling closures passed to the fetch and perform calls. We were seeing closures being deallocated without being called, and traced the dealloc down to this line in WebSocketTransport.swift:

self.subscribers[identifier] = resultHandler

If two different operations are given the same identifier, this would result in the second displacing the first in the self.subscribers map, resulting in deallocation of the first operation's resultHandler (assuming that map is the only place it is referenced, and that seems to be the case).

Version

1.0.6

Steps to reproduce the behavior

(Re-doing these steps, since I managed to replicate this in a project small enough to post.)

The code is at [link to apolloquestion0002 removed, outdated]- though it takes many restarts of the app to encounter it.

To start the server, go to the apolloquestion0002/server directory and run

nvm use v16.14.0 
node app.mjs

Then open the apolloquestion0002.xcodeproj project under apolloquestion0002/client/, reset the package caches, and run. No user interaction is required, it will connect to port 4000 on localhost where it expects the Apollo server to be (as such it needs to be run in the simulator, on the same machine where the server is running).

If the "CLOSURE WAS NOT CALLED" line at [link to apolloquestion0002 removed, outdated] is hit, that indicates that a closure to perform or fetch was discarded without ever being called. If you examine the object dumps in the log, in these cases you should see that the sequence numbers for the two operations are the same.

(I haven't delved into precisely how the repeated sequence number causes the closure to be dropped along the path through the code where the server is available; but assuming sequence numbers should never be repeated, I think showing that the numbers are repeated is sufficient to demonstrate a problem.)

So, the majority of the time, the two Operations have two different sequence numbers, and both of their result handling closures are called, and everything works. But once in a while a closure is de-allocated without being called, and the log output looks like this:

asyncFetch withCheckedContinuation called on <NSThread: 0x600000e14640>{number = 4, name = (null)}
asyncPerform withCheckedContinuation called on <NSThread: 0x600000e37840>{number = 9, name = (null)}
Dump for fetchCancellable:
Dump for performCancellable:
▿ Optional(ApolloWebSocket.WebSocketTask<SwiftCodeGeneratedByApollo.ThingByIdQuery>)
  ▿ some: ApolloWebSocket.WebSocketTask<SwiftCodeGeneratedByApollo.ThingByIdQuery> #0
    ▿ sequenceNumber: Optional("1")
      - some: "1"
[etc, etc...]
(lldb) c
Process 6076 resuming
▿ Optional(ApolloWebSocket.WebSocketTask<SwiftCodeGeneratedByApollo.CreateThingMutation>)
  ▿ some: ApolloWebSocket.WebSocketTask<SwiftCodeGeneratedByApollo.CreateThingMutation> #0
    ▿ sequenceNumber: Optional("1")
      - some: "1"
[etc, etc...]
asyncPerform perform closure called on <_NSMainThread: 0x600000e5c2c0>{number = 1, name = main}
deinit on fetchCancellable BUT CLOSURE WAS NOT CALLED
SWIFT TASK CONTINUATION MISUSE: asyncFetch() leaked its continuation!
deinit on performCancellable AND CLOSURE WAS CALLED
2023-01-24 16:46:34.154505-0600 apolloquestion0002[6076:82022] SWIFT TASK CONTINUATION MISUSE: asyncFetch() leaked its continuation!

Note that sequenceNumber is 1 for two different operations.

There is a comment in the code at [link to apolloquestion0002 removed, outdated] where I think I see a way to mitigate this. Even though the fetch and perform calls are inside a closure to withCheckedContinuation, applying @MainActor to the surrounding function seems to result in the Apollo calls happening on the main thread. If all Apollo calls happen on the same thread then that would avoid the race condition in OperationMessageIdCreator.swift requestId(). I'm not 100% sure how or why @MainActor on asyncFetch seems to propagate down through the withCheckedContinuation though...

Logs

No response

Anything else?

No response

ghost commented 1 year ago

Note: I've replaced the "steps to reproduce" above with references to a project that reproduces the problem. Specifically about 1 time in 20, one of the result-handling closures gets dropped without being called, and when that happens there always seems to be two operations with the same sequence number.

I haven't worked out all the steps in between, but it seems that (by inspection of the code) there is a race condition in requestId() in OperationMessageIdCreator.swift, and when that results in repeated sequence numbers, surprising consequences occur.

calvincestari commented 1 year ago

@wilsonmhpn 👋🏻 - thanks for the detailed bug report and ability to reproduce. We'll take a look into this and get back to you soon.

apollographql / apollo-ios