[BUG] message handler doesn't verify if task handler has processed message

mjpieters commented 2 years ago

Describe the bug

When the daemon message handler receives a message that requires the task handler to do something (like start a process), the message is forwarded via a channel, as the task handler runs in a separate thread.

At no point is there any verification that the message has been handled by the task handler, before the message handler decides what response to give to the handled message.

This can lead to weird outcomes, where the message handler states that starting a task failed, but the task handler later on starts the task anyway.

Steps to reproduce the bug

Check out my Flaky Test PR, #378, or apply the changes from the Dev support: add test-log to help debug tests commit in that PR to add test-log to the project + the test(...) attribute update for the daemon::edit::test_edit_flow test.

Run the test in a tight loop:

while RUST_LOG=debug cargo nextest run --workspace --hide-progress-bar --status-level=fail \
    daemon::edit::test_edit_flow
do :; done

At some point the test fails, but logs will show that the Start message failed because the task is locked for editing:

[2022-10-17T16:49:20Z DEBUG pueue_lib::network::protocol] Received message: Start(
        StartMessage {
            tasks: TaskIds(
                [
                    0,
                ],
            ),
            children: false,
        },
    )
[2022-10-17T16:49:20Z DEBUG pueue_lib::network::protocol] Sending message: Failure(
        "The command failed for tasks: 0",
    )

Between these to log entries, the message handler has sent the start message on to the task handler, and immediately produces a response from the current state.

In the case of a test failure, an Edit message is received by the daemon and also handled, putting the task into an unlocked, queued state.

However, the task handler has not yet obtained the mutex lock during all of this time. Even though the edit flow has been completed, the task handler finally can lock the state, finds the task in a queued state and so starts the task:

[2022-10-17T16:49:20Z INFO  pueue_daemon_lib::task_handler::spawn_task] Started task: ls -ahl

Expected behavior

The message handler should only produce a response once the task handler has processed the message too, otherwise it can produce responses that are not related to the actual state of the task handler.

Additional context

I can reproduce this issue on my Macbook Pro with M1 Pro CPU, I've not yet seen it when I run the test in a loop in a docker container running Linux. There is nothing in the implementation from preventing this bug from cropping up in Linux, however.

mjpieters commented 2 years ago

The much simpler solution would be to verify that for each task id that the task state is paused, queued or stashed, and any that are not in that state will not be forwarded to the task handler.

mjpieters commented 2 years ago

I've implemented the simple approach in #378 (specifically, the last commit in the series), I can no longer get this specific test to fail.

Nukesor / pueue