amplitude / experiment-ios-client

MIT License
2 stars 9 forks source link

ExperimentClient Semaphore Deadlock #35

Open ThePragmaticArt opened 11 months ago

ThePragmaticArt commented 11 months ago

Issue Report: Deadlock Risk in Experiment Client

Description:

When examining the provided sample code that interacts with the experiment client, it becomes apparent that there is a potential deadlock scenario, one that our project is currently hitting. The code snippets involve multiple steps in which user defaults are modified, leading to synchronous notifications and concurrent access to the storageLock. This concurrency can result in the count of the storageLock semaphore being inadvertently increased more than once in an unexpected manner.

Steps to Reproduce:

  1. Extract the project client using the ExperimentClient.client method.
  2. Initiate a call that triggers a user defaults write action, causing the didChange notification.
  3. Subscribe to UserDefaults.didChangeNotification and perform actions within the notification block.
  4. Inside the notification block, inadvertently increase the semaphore count of storageLock.
  5. Outside the notification block, trigger another semaphore count increase.

Code Illustration of Issue:

// Step 1: Extract project client
let experimentClient = ExperimentClient.client

// Step 2: Trigger user defaults write action and notification
experimentClient.fetch(user: nil, completion: nil)

// Step 3: Subscribe to user defaults change notifications
NotificationCenter.default
    .publisher(for: UserDefaults.didChangeNotification)
    .sink { notification in
        print("Notification Marker 1")
        // Step 4: Accidentally trigger semaphore count increase
        let value = experimentClient.variant("test_flag").value
        print("Notification Marker 2")
    }.store(in: &cancellables)

print("Marker 1")
// Step 5: Trigger semaphore count increase
let value = experimentClient.variant("test_flag").value
print("Marker 2")

Code Illustration of Semaphore Issue Points:

private func sourceVariants() -> [String: Variant] {
    switch config.source {
    case .LocalStorage:
        storageLock.wait()
        defer { storageLock.signal() }
        return storage.getAll()
    case .InitialVariants:
        return config.initialVariants
    }
}
private func storeVariants(_ variants: [String: Variant], _ options: FetchOptions?) {
    storageLock.wait()
    defer { storageLock.signal() }
    if (options?.flagKeys == nil) {
        storage.clear()
    }
    var failedKeys: [String] = options?.flagKeys ?? []
    for (key, variant) in variants {
        failedKeys.removeAll { $0 == key }
        self.storage.put(key: key, value: variant)
    }
    for (key) in failedKeys {
        self.storage.remove(key: key)
    }
    storage.save()
    self.debug("Stored variants: \(variants)")
}

Expected Behavior:

Concurrent access to the storageLock semaphore should be prevented to avoid potential deadlocks. The scenario described above can lead to unexpected behavior and a situation where semaphore counts are increased multiple times in an asynchronous manner.

Suggested Solution:

To address this deadlock risk, it is recommended to decouple the locks used for reading and writing processes, lower their scope of impact, and decouple them from the lifecycle of UserDefaults or other APIs that may incur unintentional side effects. It is advised to wrap the reading and writing operations with appropriate thread safety locks or leverage explicit API queues to ensure proper synchronization and prevent unintended semaphore count increases.

Current Impact:

The described deadlock scenario has been encountered in our project's practical usage. This issue report highlights the potential for deadlocks and provides a suggested solution to improve thread safety and prevent unintended concurrency issues.

bgiori commented 11 months ago

Thanks for this detailed ticket @ThePragmaticArt! Sorry for the delay in the response. I've added this to our backlog.

tarunatsortly commented 1 month ago

Hi @bgiori, Do we have any timeline on fixing this issue?