ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.44k stars 188 forks source link

Pipeline clusters #662

Closed mwylde closed 2 weeks ago

mwylde commented 2 weeks ago

This PR adds a new way to run Arroyo, currently called "pipeline clusters" in lieu of a better name.

In this mode, the user pre-configures Arroyo with a query, which the system will run on startup. Other queries can be schedule via the API or UI, but the initial one is managed by the process—so it will be automatically stopped (with a checkpoint) when the process is stopped.

As a user, it looks like this:

$ arroyo run --help
Run a query as a local pipeline cluster

Usage: arroyo run [OPTIONS] [QUERY]

Arguments:
  [QUERY]  The query to run [default: -]

Options:
  -n, --name <NAME>                Name for this pipeline
      --database <DATABASE>        Path to a database file to save to or restore from
  -p, --parallelism <PARALLELISM>  Number of parallel subtasks to run [default: 1]
  -h, --help                       Print help

$ arroyo run query.sql
2024-06-18T17:20:40.113741Z  INFO arroyo::run: Job transitioned to Scheduling
2024-06-18T17:20:40.321154Z  INFO arroyo::run: Job transitioned to Running
2024-06-18T17:20:40.423451Z  INFO arroyo::run: Pipeline running... dashboard at http://localhost:53093/pipelines/pl_PmdK8u3ydK
{"AVG(price)":64651.984675480766}
{"AVG(price)":64641.925167410714}
{"AVG(price)":64640.880196049526}
^C2024-06-18T17:20:56.872535Z  INFO arroyo::run: Stopping pipeline with a final checkpoint...
2024-06-18T17:20:57.295913Z  INFO arroyo::run: Job transitioned to Stopped

Users can also provide the query as an environment variables (ARROYO__QUERY) which may be helpful when running as a docker container, for example in ECS.

This PR also adds a new sink—StdOut—which simply prints outputs to stdout. It's the default sink for pipeline clusters, and provides an interactive experience in the console.