aquametalabs / aquameta

Web development platform built entirely in PostgreSQL
GNU General Public License v3.0
1.1k stars 52 forks source link

bundle: Version control for schema #288

Open erichanson opened 10 months ago

erichanson commented 10 months ago

Right now bundle just versions "regular" data, actual rows in normal tables. Tracking schema rows (meta.table, meta.column, meta.schema, meta.function, meta.type, etc.) is the next frontier. Past experiments have have shown it's possible to at least commit meta rows and then check them out, but certainly not with any level of robustness or fault tolerance.

Making this change necessitates some clarifications of the information model:

New Columns

In the current model, a commit is a set of:

When a row is added, it gets a rowset_row_field for every column in the row's table, basically whatever is present in the db at the time.

Now, let's say the user does a alter table public.x add column y. What happens?

a) Untracked Meta Row

Constraints will have a pg_depend-level dependency on the column (I believe) and the column will have a dependency on its table.

b) New Fields on Old Rows

After column creation, a new field appears on each row in the table. What are these?? Are they "untracked_fields"? (Not a thing). Are they just another field change, from void to existence? It is a different category of change, that does not fit into the above model.

c) Field-level Versioning vs Row-level Versioning

This distinction is at the heart of all the questions above.

It's tempting to try to pursue "field-level" tracking, and add a new dimension to the information model for "untracked_field" and "field_added". I don't believe this is impossible, but it's two steps (leaps) forward. Ideally, multiple bundles could version in an extremely granular way, version control a single field, or a single column. One bundle defines a table and some columns, but another bundle adds a column to that table and tracks its fields. This is beyond what even git does AFAICT; the git equivalent would be multiple git repositories operating in the same directory, version controlling different files simultaneously. This is the Semantic Web model, it's all just triples, a commit is a collection of meta.field_ids, period.

The simpler path is to, for this next step at least, just say that a commit saves whatever fields are present on the row at commit time. In head_db_stage, we can at least detect and display that a row has new fields, or fields deleted.

It's going to be important to keep an eye on the distinction between a field whose value has been set to NULL, and a field whose column has been deleted.

Some new functions get_commit_fields(commit_id), get_commit_rows(commit_id), get_db_fields(commit_id) and get_db_rows(commit_id) are much more axiomatic and precise (and fast) functions for addressing this space. The get_commit_* functions return exactly which rows and fields were in the previous commit, disregarding the working copy entirely. get_db_rows() returns rows from the supplied commit, along with whether or not each row exists in the db. get_db_fields() returns all fields present in the database on the supplied commit's rows; rows from the supplied commit that have been deleted are not present.

These functions are the foundation for refactoring the entire space with an eye on schema changes.

Column Renames

In terms of meta, they're a deleted row and an added row in meta.column. Maybe we can do something smarter. One cool idea is to detect them by seeing if the oid is the same.

Deleted Columns

Again, if we're doing field-level versioning, this is a new category of change, a field_delete. If not, just include whatever columns are present in the db at commit-time, in which case this is easy.

Object Dependencies

See #283.