ArroyoSystems / arroyo

Distributed stream processing engine in Rust
https://arroyo.dev
Apache License 2.0
3.44k stars 188 forks source link

Add Sqlite support for the control plane #626

Closed mwylde closed 1 month ago

mwylde commented 1 month ago

As of Arroyo 0.10, the system now ships as a single binary and can run the entire system with a single command. However, it still relies on an external dependency: postgres.

The Arroyo control plane (API and controller) rely on a database to store and set the expected configuration for the system (the list of pipelines that should exist + their expected configurations). Specifically, the expected state is set declaratively by the user via calls to the API, then the controller reconciles those desired states with the actual state it observes from its environment. It also writes back those observations of the actual state back to the DB, where they can be served to the user via the API and web ui. Beyond the reconciliation system, the database is also used to store other user configuration like connection profile, tables, and global UDFs.

Historically, Arroyo has relied on postgres as the database backing these systems. Postgres is a great database; it's extremely reliable, scalable, and powerful. However, it can be a bit complex and expensive to operate. It's also not a great fit for embedding—like we do for the arroyo-single container, which spins up a local postgres database. For development and small production deployments we want something smaller and simpler.

Enter sqlite. Sqlite is a small, fast, sql database that's designed for embedding. With support for sqlite as the backing store for the control plane we could ship a fully self-contained arroyo system that can easily run anywhere—in docker, on a laptop, lightweight orchestration services, etc.

--

This PR introduces optional support for sqlite, removing the need to run postgres. When considering the implementation, I wanted to ensure that we did not need to duplicate all of our SQL logic, or require wrapping code for every query. However, our existing SQL infrastructure is very postgres-specific: we use cornucopia, which generates Rust code based on postgres queries by talking to an actual postgres database. We think this is a great approach and a very productive way to manage SQL without relying on an ORM. However, we need a way to support sqlite as well.

After considering various options (migrating to an ORM like Diesel or SeaORM, wrapping every query in a abstraction that can call out to either database, or building our own lightweight database abstraction layer) I decided to leverage the existing code generation capabilities of cornucopia, forking it to add the ability to generate code to query SQLite as well.

This has a number of upsides; we get to reuse our existing code and infrastructure, and there's no per-query or per-call engineering overhead to support both databases. The main downside is all queries must be compatible with both postgres and sqlite, which means we can't take advantage of some of the more powerful features of each. There is also a risk that we will add queries that do not behave the same in both databases, which must be mitigated with careful testing (we do however automatically validate that queries are valid in both databases as part of the build process).

--

For users, the default behavior remains to use postgres. To enable SQLIte, run the control plane with DATABASE=sqlite; by default the database will be placed in ${config_dir}/arroyo/config.sqlite but an alternative path can be specified with DATABASE_PATH.