cmutel commented 3 months ago

Overview

Based on user needs and experience of highly qualified partners, we want to refactor our database schema. We want to add the following:

Embrace the database as a graph, and move Database, location ids, and all the LCIA stuff to a single graph. Projects remain a separate object with their own storage.
Embrace event sourcing, to allow for time travel. In this model, we never alter existing records, but treat the database itself as an append-only log. This is not only helpful in reversing mistaken changes, but also provides an audit record.
Embrace branching. Building on the work in bw_aggregation and multifunctional, we can have multiple versions of a database available for calculations. This allows users to test out changes before changing the main branch.
Fix some ugliness in the current labeling. We have grown organically over time, but the table labels haven't always kept up.
Build for deployment on multiple databases (starting with SQLite and Postgres), and for multiuser distributed data generation.

Table schema

Graphs have nodes and edges [^1] - we can just call them that.

CREATE TABLE node (
    revision_id bigint PRIMARY KEY,
    persistent_id bigint NOT NULL,
    transaction_id bigint NOT NULL,
    branch_id int DEFAULT 0,
    deleted boolean NOT NULL,
    payload jsonb DEFAULT '{}'::jsonb,
    node_type text NOT NULL,
);

CREATE TABLE public.edge (
    revision_id bigint PRIMARY KEY,
    persistent_id bigint NOT NULL,
    source_id bigint NOT NULL,
    target_id bigint NOT NULL,
    transaction_id bigint,
    branch_id int DEFAULT 0,
    deleted boolean NOT NULL,
    payload jsonb DEFAULT '{}'::jsonb,  
    edge_type text NOT NULL,
);

That's hard to understand without context; let's look at the columns.

persistent_id and revision_id: Both of these are snowflake ids generating client-side. The persistent_id stays the same for the object over all revisions; revision_id is changed on every write (and also on delete event). It is also used as the primary key, but not for foreign keys, which need a reference to the object, not to a specific revision of that object. To get the latest version of the data (per branch), we would then need to either load everything and sort client-side (sounds ugly but apparently works in production as the average number of edits per node is low), or do a group by / limit query to get the latest version of each node or edge which hasn't been deleted.
source_id and target_id: Our graph is directed, though the meaning of edge direction depends on edge type. These two columns provide both data and guarantee referential integrity for edges; each are a foreign key to a node persistent_id.
transaction_id: Snowflake id and foreign key to transaction table, which stores metadata about transactions. In most cases there won't be much (i.e. no commit message or similar), but occasionally users will do a larger set of changes and wrap it in a transaction, or merge a branch to "main".

The transaction table is simple:

CREATE TABLE transaction (
    id bigint PRIMARY KEY,
    transaction_type text NOT NULL,
    message text NULL,
    branch_id bigint DEFAULT 0,
);

This table could change in the future; we're not sure yet what the real user stories are.

branch_id: Foreign key to branch table, which gives metadata about branches. We aren't sure about the user stories here either, so start with something simple:

CREATE TABLE branch (
    id SERIAL PRIMARY KEY,
    label text NOT NULL,
);

We need to create a default row in this table with the id 0 and label "main" on table creation.

Branches normally are for investigating alternatives, or doing a big update. We should allow branches to be detached; this can be useful if the differences between the branch and "main" grow big enough the the user is describing a different product.

deleted: Flag indicating if this object has been deleted. To reverse a change or set of changes, don't use the deleted column, but just write the data at its known good state as a new event.
payload: The good stuff. Most users won't know anything about the database schema, but will only see this data, and the interface we choose to expose. This stands in contrast to the current schema, which has what are in effect generated columns, but managed at the app instead of database level.

Each payload is the complete data, not a diff. Diffs can be generated on demand if needed.

node_type and edge_type: Generated columns drawn from payload, and used only to make indexing work better. Validation is done via pydantic at the app level.

We want to stick with peewee, at least for now, so generated columns will need to be custom field classes.

This schema is for Postgres, and would required some small changes for SQLite (e.g. peewee doesn't yet support jsonb, just json).

We also have some indices:

CREATE INDEX "edge_omni_index" ON public.edge USING btree (id, branch, transaction_id DESC);
CREATE INDEX "edge_source_index" ON public.edge USING btree (edge_type, source_id DESC);
CREATE INDEX "edge_target_index" ON public.edge USING btree (edge_type, target_id DESC);

These will be adjusted (and indices added for nodes) over time.

Brightway `Databases`

A Brightway Database is a subgraph with some additional metadata. The label and metadata can be stored as a node; it is TBD if we need to explicitly add edges to indicate the "belongs to" relationship, or if we can follow the current paradigm and give a reference to the Database in each process or product node (current code looks like {'database': 'Foo'}).

ORM

There is a lot of room to play around here, but I would like to reduce the number of breaking changes as much as possible. This means that we should continue to have two Python objects, for node and edge (also called Activity and Exchange in the current code; this should be deprecated). However, we can return specific objects depending on the node type. It seems clear that returning a Database node is different than a Product node, and these returned objects have different pydantic validators, methods, etc.

I would like to try using the classes defined in bw_interface_schemas; in particular, being very explicit about the difference between Process and ProcessWithReferenceProduct would help our users and avoid confusion.

Tasks

The following basic steps are needed before we can evaluate whether to go forward with the complete refactor:

[x] Update bw_processing to allow for 64-bit identifiers (@cmutel can do this)
[ ] Create test implementation of the given schema in a branch with tests against both Postgres and SQLite
[ ] Test implementation should include Database (but not locations or LCIA) in the graph, and maintain the current Database and databases API.

Edit history

v1: Initial proposal.
v2: Clarified that payload is complete data, not a diff; added idea of detaching branches.

[^1]: We could argue about the labels, but these are good enough, and understandable for our users.

cmutel commented 3 months ago

@selimyoussry Feel free to add comments or questions!

cmutel commented 3 months ago

Our current Database.process function assumes that we can create separate iterators for technosphere and biosphere matrices; this should work with peewee JSON support. However, the implicit production should go away. We can automatically add a production exchange with amount 1 when creating a ProcessWithReferenceProduct, but this doesn't work with Process (we don't know what the reference product would be).

cmutel commented 3 months ago

bw_processing 1.0 released which now defaults to 64-bit indices.

will7200 commented 2 months ago

@cmutel

After looking into this i am fighting peewee considerable trying to get a unified schema between Postgres and Sqlite3. Peewee doesn't support a good way for overloading function types based on database like sqlalchemy does. Nor does it support a good way to implement a column with differing functionality based on the database.

Some approaches that I can current go about this:

I can implement both schemas separately. The way the schema get consumed will have to be changed so that they can be switched
Implement a single schema, but build a custom field that bridges JSON support between them by using the database that is attached to the current model. Has the benefit that we only define the schema once.

If you have any other ideas that might be worth exploring let me know. I have more of a sqlalchemy background than with peewee, so my mind keeps telling me to make the switch already lol.

Also looks like jsonb (blob) support in the sqlite has been implemented since version 3.45.0 but since we support older python versions I'll stick with text for now.

cmutel commented 2 months ago

@will7200 Trying for a unified schema won't work with peewee - we have come to that conclusion as well.

I am open to switching to sqlalchemy. It is a much more reasonable long-term choice. We actually want to reduce complexity in the database with this change, so could end up removing code instead of rewriting it, at least sometimes.

I also think it is fine to only target 3.11+ (I can't find a table linking SQLite/Python release versions). Even using 3.12+ would be fine. I think we can also prioritize Postgres development speed and elegance for now.

We definitely want to move towards the database being accessible via API. Please don't start this now, but keep it in mind. Will almost certainly be FastAPI and Pydantic, and Pydantic will be used for validation of the json payloads. Indeed, we could already enable validation when creating new objects (though maybe skip it for bulk inserts).

brightway-lca / brightway2-data

Database event sourcing refactor #181

Overview

Table schema

Brightway `Databases`

ORM

Tasks

Edit history

brightway-lca / brightway2-data

Database event sourcing refactor #181

Overview

Table schema

Brightway Databases

ORM

Tasks

Edit history

Brightway `Databases`