Closed guibou closed 3 months ago
Note: I'm sorry I don't really know what test to write. This is a race condition which chances of occurence are reduced by the number of it
in the test suite. So it would need a dedicated test suite with only one test. And in this context, chances of having the race condition happening are limited, even when runned with --continous
, it takes sometime up to 30K iterations before failing.
@guibou This is amazing and I'll look through it more closely when I get some time.
For a test, maybe you could have a test that runs sydTest
inside the test with 10000
workers and only one it
that does True `shouldBe` True
?
@guibou I tried writing a failing test case but even this didn't fail:
spec :: Spec
spec = do
it "Can run an empty test suite with many runners." $ do
let workers = 100
let settings = defaultSettings {settingThreads = Asynchronous workers}
let runs = 100000
replicateM_ runs $ sydTestWith settings $ do
it "is a simple assertion" $ do
True `shouldBe` True
@NorfairKing I still don't know what to do for the unit test.
I'm unsure having a lot of workers increases the chances of a failure, because it will increase the time that the blockUntilDone
takes to walk the list of QSem
. My current bet is that we want really few workers (so 1
).
Actually, this seems to work:
spec :: Spec
spec = describe "WTF" $ do
do
let workers = 100
let settings = defaultSettings {settingThreads = Asynchronous 1}
let runs = 1000000
it "WTF" $ do
replicateM_ runs $ do
sydTestWith settings $ do
it "WTF" $ do
True `shouldBe` True
@guibou Let's add that as a regression test then. Does it still error when we turn off output? If so we can do that to make the output not flip out.
Ok this reproduces the same problem:
spec :: Spec
spec = do
it "Can run an empty test suite with many runners." $ do
let settings = defaultSettings {settingThreads = Asynchronous 1}
let runs = 100000
let simpleSpec :: Spec
simpleSpec =
it "is a simple assertion" $ do
True `shouldBe` True
specForest <- execTestDefM settings simpleSpec
replicateM_ runs $ do
void $ runSpecForestAsynchronously settings 1 specForest
@guibou I put the regression test on branch race-condition
and your fix on the branch race-condition-fix
.
It looks like your fix works!
Can you please get stack test --pedantic --no-rerun
and nix flake check
passing? Then I can merge.
Please add your company to the contributors list. This lets you use sydtest commercially.
@NorfairKing
Sorry for the delay, nix flake check
was taking days because of the mongodb
build. I finally removed mongodb
from the shell and was able to run the commands.
stack test
, I'm surprised to not see the GHC warning during my development session in HLS@guibou merging now.
I've pushed the fix on the branch on my repo and opened a new MR (I don't know how to push directly on the current MR),
You can just push to the same branch and github takes care of it.
Merged manually.
This closes bug #82.
The
blockUntilDone
andjobQueueWorker
had a race condition. When the latest job wasdequeueJob
from the queue and before the worker pick the semaphore withwaitQSem
, theblockUntilDone
function was able to terminate (e.g. the queue was empty and all semaphore were available).The result of this race condition was that before the latest job completion, the
runner
will terminate and this would have three possible outcomes:AsyncCancel
exception in the job, when workers were canceled, leading to a failure of the latest test (The exception was caught by the test runner). This exception would then fill the status of the job, allowing the correct termination of thewaiter
job. If the exception was received by the worker "outside" of the "test" logic, it would bubble up and be rethrown in therunner
thread, killing it which would cause the second outcome,Main: thread blocked indefinitely in an MVar operation
when thewaiter
job, still waiting for the end of a job or the fastFail, realised that thefastFail
was no longer writable (runner
job was dead).Note that these are the observable results of this race condition, but there is more fun, the
sequential
block was actually NOT guaranteed to be run in perfect isolation (for the same reason: early termination of theblockUntilDone
logic).In order to fix this, picking a job in the queue and notifying that a worker is working should be atomic. (e.g.
dequeueJob
andwaitQSem
injobQueueWorker
should be atomic). This was not trivial / simple to do by using theQSem
, whith is not composable with thedequeueJob
(e.g.IO
based synchro vsSTM
ones) so instead this commit replaces theN
semaphore by oneTVar Int
accounting for the number of currently running jobs. The atomicity of thedequeue
andnotification
is now ensured by anSTM
transaction.Note that this implementation lost the feature of knowing which worker thread is used or not: it was previously a
Vector
ofQSem
, each cell of the vector containing the information if the worker was used or not. This is replaced by a global count of worker used. If this feature is requested in the future, theTVar Int
could be replaced by aTVar (Vector Bool)
or any similar bitfield handled bySTM
.A note about
blockUntilDone
: the wait for the empty job queue and waiting for the number of used worker to be0
is now done in the same STM transaction. In the current context it does not change anything (a running job is not able, as much as we know, to push jobs in the queue), so the queue cannot be refilled once empty and there is only one producer of the queue: the main walk,goForest
. However having both in the same transaction does not impact the performance and may protect a future refactoring allowing new jobs to be generated on the fly.