brightway-lca / brightway2-data

Tools for the management of inventory databases and impact assessment methods. Part of the Brightway LCA framework.
https://docs.brightway.dev/
BSD 3-Clause "New" or "Revised" License
11 stars 24 forks source link

Database event sourcing refactor #181

Open cmutel opened 3 months ago

cmutel commented 3 months ago

Overview

Based on user needs and experience of highly qualified partners, we want to refactor our database schema. We want to add the following:

Table schema

Graphs have nodes and edges [^1] - we can just call them that.

CREATE TABLE node (
    revision_id bigint PRIMARY KEY,
    persistent_id bigint NOT NULL,
    transaction_id bigint NOT NULL,
    branch_id int DEFAULT 0,
    deleted boolean NOT NULL,
    payload jsonb DEFAULT '{}'::jsonb,
    node_type text NOT NULL,
);

CREATE TABLE public.edge (
    revision_id bigint PRIMARY KEY,
    persistent_id bigint NOT NULL,
    source_id bigint NOT NULL,
    target_id bigint NOT NULL,
    transaction_id bigint,
    branch_id int DEFAULT 0,
    deleted boolean NOT NULL,
    payload jsonb DEFAULT '{}'::jsonb,  
    edge_type text NOT NULL,
);

That's hard to understand without context; let's look at the columns.

The transaction table is simple:

CREATE TABLE transaction (
    id bigint PRIMARY KEY,
    transaction_type text NOT NULL,
    message text NULL,
    branch_id bigint DEFAULT 0,
);

This table could change in the future; we're not sure yet what the real user stories are.

CREATE TABLE branch (
    id SERIAL PRIMARY KEY,
    label text NOT NULL,
);

We need to create a default row in this table with the id 0 and label "main" on table creation.

Branches normally are for investigating alternatives, or doing a big update. We should allow branches to be detached; this can be useful if the differences between the branch and "main" grow big enough the the user is describing a different product.

Each payload is the complete data, not a diff. Diffs can be generated on demand if needed.

We want to stick with peewee, at least for now, so generated columns will need to be custom field classes.

This schema is for Postgres, and would required some small changes for SQLite (e.g. peewee doesn't yet support jsonb, just json).

We also have some indices:

CREATE INDEX "edge_omni_index" ON public.edge USING btree (id, branch, transaction_id DESC);
CREATE INDEX "edge_source_index" ON public.edge USING btree (edge_type, source_id DESC);
CREATE INDEX "edge_target_index" ON public.edge USING btree (edge_type, target_id DESC);

These will be adjusted (and indices added for nodes) over time.

Brightway Databases

A Brightway Database is a subgraph with some additional metadata. The label and metadata can be stored as a node; it is TBD if we need to explicitly add edges to indicate the "belongs to" relationship, or if we can follow the current paradigm and give a reference to the Database in each process or product node (current code looks like {'database': 'Foo'}).

ORM

There is a lot of room to play around here, but I would like to reduce the number of breaking changes as much as possible. This means that we should continue to have two Python objects, for node and edge (also called Activity and Exchange in the current code; this should be deprecated). However, we can return specific objects depending on the node type. It seems clear that returning a Database node is different than a Product node, and these returned objects have different pydantic validators, methods, etc.

I would like to try using the classes defined in bw_interface_schemas; in particular, being very explicit about the difference between Process and ProcessWithReferenceProduct would help our users and avoid confusion.

Tasks

The following basic steps are needed before we can evaluate whether to go forward with the complete refactor:

Edit history

[^1]: We could argue about the labels, but these are good enough, and understandable for our users.

cmutel commented 3 months ago

@selimyoussry Feel free to add comments or questions!

cmutel commented 3 months ago

Our current Database.process function assumes that we can create separate iterators for technosphere and biosphere matrices; this should work with peewee JSON support. However, the implicit production should go away. We can automatically add a production exchange with amount 1 when creating a ProcessWithReferenceProduct, but this doesn't work with Process (we don't know what the reference product would be).

cmutel commented 3 months ago

bw_processing 1.0 released which now defaults to 64-bit indices.

will7200 commented 2 months ago

@cmutel

After looking into this i am fighting peewee considerable trying to get a unified schema between Postgres and Sqlite3. Peewee doesn't support a good way for overloading function types based on database like sqlalchemy does. Nor does it support a good way to implement a column with differing functionality based on the database.

Some approaches that I can current go about this:

  1. I can implement both schemas separately. The way the schema get consumed will have to be changed so that they can be switched
  2. Implement a single schema, but build a custom field that bridges JSON support between them by using the database that is attached to the current model. Has the benefit that we only define the schema once.

If you have any other ideas that might be worth exploring let me know. I have more of a sqlalchemy background than with peewee, so my mind keeps telling me to make the switch already lol.

Also looks like jsonb (blob) support in the sqlite has been implemented since version 3.45.0 but since we support older python versions I'll stick with text for now.

cmutel commented 2 months ago

@will7200 Trying for a unified schema won't work with peewee - we have come to that conclusion as well.

I am open to switching to sqlalchemy. It is a much more reasonable long-term choice. We actually want to reduce complexity in the database with this change, so could end up removing code instead of rewriting it, at least sometimes.

I also think it is fine to only target 3.11+ (I can't find a table linking SQLite/Python release versions). Even using 3.12+ would be fine. I think we can also prioritize Postgres development speed and elegance for now.

We definitely want to move towards the database being accessible via API. Please don't start this now, but keep it in mind. Will almost certainly be FastAPI and Pydantic, and Pydantic will be used for validation of the json payloads. Indeed, we could already enable validation when creating new objects (though maybe skip it for bulk inserts).