Sage-Bionetworks / schematic_db

A python package that manages data backends for schematic
0 stars 1 forks source link

Schematic DB

This package is used to manage backends in the FAIR data flow.

It is assumed that manifests are stored in a project at Synapse, and have been validated via Schematic, or the Data Curator App.

Use as a package

Installation

Schematic DB's is published to PyPI whenever there is a new release.

Note that Schematic DB has 3 extra packages [mysql, postgres, synapse]. These serve as the database backend for Schematic DB. You will need one of these to run this package and can be installed like:

pip install schematic_db[synapse]

If using Schematic DB as part of a package you will want setup the pyproject.toml like schematic-db = {version = "x.y.z", extras = ["synapse"]}

Usage

Schema object

The Schema class interacts with the Schematic API. It is used to create a database schema from a schema in json ld form.

It is assumed that you have setup a Synapse project where your manifests exist.

To create the schema:

from schematic_db.schema import Schema, SchemaConfig
config = SchemaConfig(
        schema_url = "https://raw.githubusercontent.com/Sage-Bionetworks/Schematic-DB-Test-Schemas/main/test_schema.jsonld"
    )
schema = Schema(config)

Manifest Store object

The ManifestStore class is used to interact with the asset store in Synapse to download manifests needed to update the database.

It is assumed that you have setup a Synapse project where your manifests exist, and the appropriate fileview.

There are two types of ManifestStore classes. The APIManifestStore class uses the Schematic API. The SynapseManifestStore uses the python Synapse client. This is both experimental and deprecated.

To create the manifest store:

from schematic_db.manifest_store import APIManifestStore, SynapseManifestStore  ManifestStoreConfig
config = ManifestStoreConfig(
        schema_url = "https://raw.githubusercontent.com/Sage-Bionetworks/Schematic-DB-Test-Schemas/main/test_schema.jsonld",
        synapse_project_id = "syn1",
        synapse_asset_view_id = "syn2",
        synapse_input_token = "xxx"
    )
manifest_store = APIManifestStore(config)

Relational database objects

The various database objects are how Schematic DB interacts with the underlying database

For a SQL based database:

from schematic_db.rdb.sql_alchemy_database import SQLConfig
from schematic_db.rdb.mysql import MySQLDatabase
from schematic_db.rdb.postgres import PostgresDatabase

config = SQLConfig(
        username="username",
        password="pass",
        host="host",
        name="test_schema",
    )

database = PostgresDatabase(config)
database = MySQLDatabase(config)

For a Synapse based database:

from schematic_db.rdb.synapse_database import SynapseDatabase

database =  SynapseDatabase(
    project_id="syn1,
    auth_token="xxx,
)

Building the database

The RDBBuilder class is responsible for building the database schema. Assuming you've built a Schema object, and RelationalDatabase object as described above, to build the database:

from schematic_db.rdb_builder.rdb_builder import RDBBuilder

rdb_builder = RDBBuilder(rdb=database, schema=schema)
rdb_builder.build_database()

Updating the database

The RDBUpdater class is responsible for updating the database when there are new or updated manifests. It does NOT update the schema. Assuming you've built a ManifestStore object, and RelationalDatabase object as described above, to update the database:

from schematic_db.rdb_updater.rdb_updater import RDBUpdater

RDBUpdater(rdb=database, manifest_store=manifest_store)
rdb_updater.update_database()

Dropping tables

Assuming that a database object exists created as shown above, tables can be dropped from the database:

database.drop_table(table_name)

This only works if the table in question has no dependencies in the database schema

The above works in Synapse databases, but there is an additional method:

synapse_database.drop_table_and_dependencies(table_name)

This will drop the table in question, and all tables that depend on it in the schema.

Note: When a table is dropped in Synapse, that means that the schema, all rows, and all annotations are deleted, but the entity remains. This is in order to preserve the synapse id and table name.

If you want to delete the entity as well:

synapse_database.delete_table(table_name)

Deleting row data

Assuming that a database object exists created as shown above, tables can be dropped from the database:

database.delete_table_rows(table_name, dataframe)

The dataframe must contain the primary key of the table, the values in the primary key column will determine what rows are deleted.

Note: This is a cascading delete. Any rows that reference rows in the dataframe will also be deleted. Note: This is also recursive. Rows in tables that reference rows in tables that reference the rows in the dataframe and so on, will also be deleted.

Configuring your database schema

In general Schematic DB will use your data model to determine the database schema. If you want to change that behavior you can use a DatabaseConfig object to do this.

Create the object like


from schematic_db.schema import Schema, SchemaConfig
from schematic_db.database_config import DatabaseConfig

data = [
    {
        "name": "object1",
        "primary_key": "att1",
        "foreign_keys": [
            {
                "column_name": "att2",
                "foreign_table_name": "object2",
                "foreign_column_name": "att1",
            },
            {
                "column_name": "att3",
                "foreign_table_name": "object3",
                "foreign_column_name": "att1",
            },
        ],
        "columns": [
            {
                "name": "att2",
                "datatype": "str",
                "required": True,
                "index": True,
            },
            {
                "name": "att3",
                "datatype": "int",
                "required": False,
                "index": False,
            },
        ],
    },
    {"name": "object2", "primary_key": "att1"},
    {"name": "object3", "primary_key": "att1"},
]
database_config = DatabaseConfig(data)

config = SchemaConfig(
        schema_url = "https://raw.githubusercontent.com/Sage-Bionetworks/Schematic-DB-Test-Schemas/main/test_schema.jsonld"
    )
schema = Schema(config=config, database_config=database_config)

Local Development

Setup

Schematic DB can make use of various db(and non db) backends. Currently these include:

If you intend to use either MySQL or Postgres you will need to install those, and can follow the above links.

Clone the schematic_db package repository.

git clone https://github.com/Sage-Bionetworks/schematic_db.git

Install poetry (version 1.2 or later) using either the official installer or pipx. If you have an older installation of Poetry, we recommend uninstalling it first.

Start the virtual environment by doing:

poetry shell

To install all the packages you will have needed to install both MySQL and Postgres(See above)

Install all the dependencies by doing:

poetry install --all-extras

If you just want to install dependencies for one type of database (ie. Synapse):

poetry install -E synapse

Tests

Secrets

Create a secrets file from template:

cp tests/data/example_secrets.yml tests/data/secrets.yml

tests/data/secrets.yml is in the .gitignore file so it can not be commited by git. There are sections for each type of database. Update the section(s) that are relevant to the code you will be working on.

If you want to work on and test code relating to Synapse Databases you will need a project in Synapse to test on. You will need to create an empty Synapse project. The project_id field is the Synapse id for the project. The username and auth_token must be for an account with edit and delete permission for the project.

Formatting tests

Before making a pull request you will want to make sure sure your code is formatted the correct way. Github will run the following tests:

mypy --disallow-untyped-defs .
black .
pylint schematic_db/* tests/*

You will want to make sure any changes passes the above tests before making a pull request.

Unit tests

Before making a pull request you will want to make sure sure your changes haven't broken any existing tests. The github workflow will do:

pytest

Architecture

Documentation

Schematic db uses pdoc to generate documentation from docstrings. This exists here

Classes

Database Schema

The DatabaseSchema class is used to store data for the database in form agnostic to the type. It is cerated by a Schema object, and used by the various relational database classes.

Schema

The Schema class is used to interact with the Schematic API to create a generic representation of the database from the schema. This is stored as a DatabaseSchema object. The Schema is used by the RDBBuilder class.

Manifest Store

The ManifestStore is an Abstract Base Class. This means it is meant to be inherited from and not used. It provides no functionality, just as interface. Inheriting from it implies that child class must implement it's methods with the same signature. The child classes(see blow) interact with the asset store in Synapse to download manifests needed to update the database. The ManifestStore is used by the RDBUpdater class.

The APIManifestStore class uses the Schematic API. The SynapseManifestStore uses the python Synapse client.

Schema Graph

The SchemaGraph class is used to interact with the Schematic API to create the schema in graph form, and is used by several other classes.

Synapse

The Synapse class if a facade for the SynapseClient library. It is used by several classes.

Relational Database

The RelationalDatabase is an Abstract Base Class. This means it is meant to be inherited from and not used. It provides no functionality, just as interface. Inheriting from it implies that child class must implement it's methods with the same signature. This is done so that all the database classes work the same.

The SQLAlchemyDatabase is not meant to be used, only be inherited from. It provides generic SQLALchemy functionality.

The MySQL and Postgres classes both inherit from the SQLAlchemyDatabase and so inherit it's code and the RelationDatabase interface.

The SynapseDatabase inherits from the RelationDatabase, so it implements the RelationDatabase interface. It uses the Synapse class. This is both experimental and deprecated.

Query Store

The QueryStore is an Abstract Base Class. This means it is meant to be inherited from and not used. It provides no functionality, just as interface. Inheriting from it implies that child class must implement it's methods with the same signature. This is done so that all the query store classes work the same. The query store is used by the RDBQueryer class.

The SynapseQueryStore inherits from the QueryStore, so it implements its interface. It uses the Synapse class.

RDB Builder

The RDBBuilder class is responsible for building the database schema. It uses a Schema object to create a DatabaseSchema. This is used to build each table using a RelationalDatabase object.

RDB Updater

The RDBUpdater class is responsible for updating the database when there are new or updated manifests. It does NOT update the schema. It uses a ManifestStore to download the manifests and a RelationalDatabase object to update the database.

RDB Queryer

The RDBQueryer class is used to query the database and store the results. It uses a RelationalDatabase object to query the database, and a QueryStore object to store the result.