fix: race condition in asynchronous runner

guibou commented 3 months ago

This closes bug #82.

The blockUntilDone and jobQueueWorker had a race condition. When the latest job was dequeueJob from the queue and before the worker pick the semaphore with waitQSem, the blockUntilDone function was able to terminate (e.g. the queue was empty and all semaphore were available).

The result of this race condition was that before the latest job completion, the runner will terminate and this would have three possible outcomes:

Either an AsyncCancel exception in the job, when workers were canceled, leading to a failure of the latest test (The exception was caught by the test runner). This exception would then fill the status of the job, allowing the correct termination of the waiter job. If the exception was received by the worker "outside" of the "test" logic, it would bubble up and be rethrown in the runner thread, killing it which would cause the second outcome,
Either an Main: thread blocked indefinitely in an MVar operation when the waiter job, still waiting for the end of a job or the fastFail, realised that the fastFail was no longer writable (runner job was dead).

Note that these are the observable results of this race condition, but there is more fun, the sequential block was actually NOT guaranteed to be run in perfect isolation (for the same reason: early termination of the blockUntilDone logic).

In order to fix this, picking a job in the queue and notifying that a worker is working should be atomic. (e.g. dequeueJob and waitQSem in jobQueueWorker should be atomic). This was not trivial / simple to do by using the QSem, whith is not composable with the dequeueJob (e.g. IO based synchro vs STM ones) so instead this commit replaces the N semaphore by one TVar Int accounting for the number of currently running jobs. The atomicity of the dequeue and notification is now ensured by an STM transaction.

Note that this implementation lost the feature of knowing which worker thread is used or not: it was previously a Vector of QSem, each cell of the vector containing the information if the worker was used or not. This is replaced by a global count of worker used. If this feature is requested in the future, the TVar Int could be replaced by a TVar (Vector Bool) or any similar bitfield handled by STM.

A note about blockUntilDone: the wait for the empty job queue and waiting for the number of used worker to be 0 is now done in the same STM transaction. In the current context it does not change anything (a running job is not able, as much as we know, to push jobs in the queue), so the queue cannot be refilled once empty and there is only one producer of the queue: the main walk, goForest. However having both in the same transaction does not impact the performance and may protect a future refactoring allowing new jobs to be generated on the fly.

guibou commented 3 months ago

Note: I'm sorry I don't really know what test to write. This is a race condition which chances of occurence are reduced by the number of it in the test suite. So it would need a dedicated test suite with only one test. And in this context, chances of having the race condition happening are limited, even when runned with --continous, it takes sometime up to 30K iterations before failing.

NorfairKing commented 3 months ago

@guibou This is amazing and I'll look through it more closely when I get some time. For a test, maybe you could have a test that runs sydTest inside the test with 10000 workers and only one it that does True `shouldBe` True?

NorfairKing commented 3 months ago

@guibou I tried writing a failing test case but even this didn't fail:

spec :: Spec
spec = do
  it "Can run an empty test suite with many runners." $ do
    let workers = 100
    let settings = defaultSettings {settingThreads = Asynchronous workers}
    let runs = 100000
    replicateM_ runs $ sydTestWith settings $ do
      it "is a simple assertion" $ do
        True `shouldBe` True

guibou commented 3 months ago

@NorfairKing I still don't know what to do for the unit test.

I'm unsure having a lot of workers increases the chances of a failure, because it will increase the time that the blockUntilDone takes to walk the list of QSem. My current bet is that we want really few workers (so 1).

Actually, this seems to work:

spec :: Spec
spec = describe "WTF" $ do
  do
    let workers = 100
    let settings = defaultSettings {settingThreads = Asynchronous 1}
    let runs = 1000000
    it "WTF" $ do
      replicateM_ runs $ do
        sydTestWith settings $ do
          it "WTF" $ do
            True `shouldBe` True

NorfairKing commented 3 months ago

@guibou Let's add that as a regression test then. Does it still error when we turn off output? If so we can do that to make the output not flip out.

NorfairKing commented 3 months ago

Ok this reproduces the same problem:

spec :: Spec
spec = do
  it "Can run an empty test suite with many runners." $ do
    let settings = defaultSettings {settingThreads = Asynchronous 1}
    let runs = 100000
    let simpleSpec :: Spec
        simpleSpec =
          it "is a simple assertion" $ do
            True `shouldBe` True
    specForest <- execTestDefM settings simpleSpec
    replicateM_ runs $ do                       
      void $ runSpecForestAsynchronously settings 1 specForest

NorfairKing commented 3 months ago

@guibou I put the regression test on branch race-condition and your fix on the branch race-condition-fix. It looks like your fix works! Can you please get stack test --pedantic --no-rerun and nix flake check passing? Then I can merge. Please add your company to the contributors list. This lets you use sydtest commercially.

guibou commented 3 months ago

@NorfairKing

Sorry for the delay, nix flake check was taking days because of the mongodb build. I finally removed mongodb from the shell and was able to run the commands.

Regarding stack test, I'm surprised to not see the GHC warning during my development session in HLS
I've pushed the fix on the branch on my repo and opened a new MR (I don't know how to push directly on the current MR), https://github.com/NorfairKing/sydtest/pull/86

NorfairKing commented 3 months ago

@guibou merging now.

I've pushed the fix on the branch on my repo and opened a new MR (I don't know how to push directly on the current MR),

You can just push to the same branch and github takes care of it.

NorfairKing commented 3 months ago

Merged manually.

NorfairKing / sydtest

fix: race condition in asynchronous runner #83