Closed mjpieters closed 2 years ago
The much simpler solution would be to verify that for each task id that the task state is paused, queued or stashed, and any that are not in that state will not be forwarded to the task handler.
I've implemented the simple approach in #378 (specifically, the last commit in the series), I can no longer get this specific test to fail.
Describe the bug
When the daemon message handler receives a message that requires the task handler to do something (like start a process), the message is forwarded via a channel, as the task handler runs in a separate thread.
At no point is there any verification that the message has been handled by the task handler, before the message handler decides what response to give to the handled message.
This can lead to weird outcomes, where the message handler states that starting a task failed, but the task handler later on starts the task anyway.
Steps to reproduce the bug
Check out my Flaky Test PR, #378, or apply the changes from the Dev support: add test-log to help debug tests commit in that PR to add
test-log
to the project + thetest(...)
attribute update for thedaemon::edit::test_edit_flow
test.Run the test in a tight loop:
At some point the test fails, but logs will show that the Start message failed because the task is locked for editing:
Between these to log entries, the message handler has sent the start message on to the task handler, and immediately produces a response from the current state.
In the case of a test failure, an
Edit
message is received by the daemon and also handled, putting the task into an unlocked, queued state.However, the task handler has not yet obtained the mutex lock during all of this time. Even though the edit flow has been completed, the task handler finally can lock the state, finds the task in a queued state and so starts the task:
Expected behavior
The message handler should only produce a response once the task handler has processed the message too, otherwise it can produce responses that are not related to the actual state of the task handler.
Additional context
I can reproduce this issue on my Macbook Pro with M1 Pro CPU, I've not yet seen it when I run the test in a loop in a docker container running Linux. There is nothing in the implementation from preventing this bug from cropping up in Linux, however.