Open cmutel opened 3 months ago
@selimyoussry Feel free to add comments or questions!
Our current Database.process
function assumes that we can create separate iterators for technosphere and biosphere matrices; this should work with peewee
JSON support. However, the implicit production should go away. We can automatically add a production exchange with amount 1 when creating a ProcessWithReferenceProduct
, but this doesn't work with Process
(we don't know what the reference product would be).
bw_processing
1.0 released which now defaults to 64-bit indices.
@cmutel
After looking into this i am fighting peewee considerable trying to get a unified schema between Postgres and Sqlite3. Peewee doesn't support a good way for overloading function types based on database like sqlalchemy
does. Nor does it support a good way to implement a column with differing functionality based on the database.
Some approaches that I can current go about this:
If you have any other ideas that might be worth exploring let me know. I have more of a sqlalchemy background than with peewee, so my mind keeps telling me to make the switch already lol.
Also looks like jsonb (blob
) support in the sqlite has been implemented since version 3.45.0
but since we support older python versions I'll stick with text
for now.
@will7200 Trying for a unified schema won't work with peewee - we have come to that conclusion as well.
I am open to switching to sqlalchemy. It is a much more reasonable long-term choice. We actually want to reduce complexity in the database with this change, so could end up removing code instead of rewriting it, at least sometimes.
I also think it is fine to only target 3.11+ (I can't find a table linking SQLite/Python release versions). Even using 3.12+ would be fine. I think we can also prioritize Postgres development speed and elegance for now.
We definitely want to move towards the database being accessible via API. Please don't start this now, but keep it in mind. Will almost certainly be FastAPI and Pydantic, and Pydantic will be used for validation of the json payloads. Indeed, we could already enable validation when creating new objects (though maybe skip it for bulk inserts).
Overview
Based on user needs and experience of highly qualified partners, we want to refactor our database schema. We want to add the following:
Database
, location ids, and all the LCIA stuff to a single graph. Projects remain a separate object with their own storage.Table schema
Graphs have nodes and edges [^1] - we can just call them that.
That's hard to understand without context; let's look at the columns.
persistent_id
andrevision_id
: Both of these are snowflake ids generating client-side. Thepersistent_id
stays the same for the object over all revisions;revision_id
is changed on every write (and also ondelete
event). It is also used as the primary key, but not for foreign keys, which need a reference to the object, not to a specific revision of that object. To get the latest version of the data (per branch), we would then need to either load everything and sort client-side (sounds ugly but apparently works in production as the average number of edits per node is low), or do a group by / limit query to get the latest version of each node or edge which hasn't been deleted.source_id
andtarget_id
: Our graph is directed, though the meaning of edge direction depends on edge type. These two columns provide both data and guarantee referential integrity for edges; each are a foreign key to a nodepersistent_id
.transaction_id
: Snowflake id and foreign key to transaction table, which stores metadata about transactions. In most cases there won't be much (i.e. no commit message or similar), but occasionally users will do a larger set of changes and wrap it in a transaction, or merge a branch to "main".The transaction table is simple:
This table could change in the future; we're not sure yet what the real user stories are.
branch_id
: Foreign key to branch table, which gives metadata about branches. We aren't sure about the user stories here either, so start with something simple:We need to create a default row in this table with the
id
0 and label "main" on table creation.Branches normally are for investigating alternatives, or doing a big update. We should allow branches to be detached; this can be useful if the differences between the branch and "main" grow big enough the the user is describing a different product.
deleted
: Flag indicating if this object has been deleted. To reverse a change or set of changes, don't use thedeleted
column, but just write the data at its known good state as a new event.payload
: The good stuff. Most users won't know anything about the database schema, but will only see this data, and the interface we choose to expose. This stands in contrast to the current schema, which has what are in effect generated columns, but managed at the app instead of database level.Each
payload
is the complete data, not a diff. Diffs can be generated on demand if needed.node_type
andedge_type
: Generated columns drawn frompayload
, and used only to make indexing work better. Validation is done via pydantic at the app level.We want to stick with peewee, at least for now, so generated columns will need to be custom field classes.
This schema is for Postgres, and would required some small changes for SQLite (e.g.
peewee
doesn't yet supportjsonb
, justjson
).We also have some indices:
These will be adjusted (and indices added for nodes) over time.
Brightway
Databases
A Brightway
Database
is a subgraph with some additional metadata. The label and metadata can be stored as anode
; it is TBD if we need to explicitly add edges to indicate the "belongs to" relationship, or if we can follow the current paradigm and give a reference to theDatabase
in each process or product node (current code looks like{'database': 'Foo'}
).ORM
There is a lot of room to play around here, but I would like to reduce the number of breaking changes as much as possible. This means that we should continue to have two Python objects, for
node
andedge
(also calledActivity
andExchange
in the current code; this should be deprecated). However, we can return specific objects depending on the node type. It seems clear that returning aDatabase
node is different than aProduct
node, and these returned objects have different pydantic validators, methods, etc.I would like to try using the classes defined in bw_interface_schemas; in particular, being very explicit about the difference between
Process
andProcessWithReferenceProduct
would help our users and avoid confusion.Tasks
The following basic steps are needed before we can evaluate whether to go forward with the complete refactor:
Database
(but not locations or LCIA) in the graph, and maintain the currentDatabase
anddatabases
API.Edit history
payload
is complete data, not a diff; added idea of detaching branches.[^1]: We could argue about the labels, but these are good enough, and understandable for our users.