ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.81k stars 220 forks source link

Fix race condition that could stall scheduling #712

Closed mwylde closed 3 months ago

mwylde commented 3 months ago

During scheduling, the controller sends the task assignments to the workers then waits for the tasks to start up. Each worker engine then constructs its graph and starts of the "local nodes"—i.e., the ones that it is responsible for running.

Each operator on startup follows these steps:

  1. Notify the controller of task startup
  2. Call on_start
  3. Wait for all other operators to have started
  4. Call run

If any of these steps panic, a TaskFailed message is sent to the controller.

However, if an operator panicked in step 2 at the wrong time, the pipeline could end up stuck while the controller thought it was healthy in the running state.

Why?

For the problem to occur, all three three issues are required.

This PR fixes the first and third issue, and ensures that a pipeline will either get into a true running state or fail and get restarted by the controller:

Fixing the second issue—for example by allowing the barrier to be canceled on panic—is left as a future improvement.