USF-IMARS / imars-etl

:cloud: Tools for `extract` and `load` for IMaRS ETL (Extract, Transform, Load) operations
0 stars 0 forks source link

implement real ORM #7

Open 7yl4r opened 6 years ago

7yl4r commented 6 years ago

It's about time we get a real "Object Relational Mapper" in place. Probably SQLAlchemy.

tableschema looks pretty interesting too though...

7yl4r commented 6 years ago

It's going to be weird to figure out how this works alongside puppet. Also it may make sense to define the engine in an entirely different project. :thinking:

7yl4r commented 6 years ago

oh, okay... tableschema is a generalized standard that can leverage sqlalchemy ref:

import datapackage

# create DataPackage from TableSchema || Tablular Data Package json :
url = 'https://raw.githubusercontent.com/frictionlessdata/example-data-packages/master/periodic-table/datapackage.json'
dp = datapackage.Package(url)

# create the database connection (using SQLAlchemy)
from sqlalchemy import create_engine

# Load and save table to SQL
engine = create_engine('sqlite:///periodic-table-datapackage.db')
dp.save(storage='sql', engine=engine)

# test
list(engine.execute('SELECT * from data')))
7yl4r commented 6 years ago

So really we are just going to replace product_metadata.sql in imars_puppet with SQLAlchemy python or TableSchema json. Let's test each out and compare:


MySQL SQL to recreate:

CREATE TABLE area (
    id int(11) NOT NULL AUTO_INCREMENT,
    PRIMARY KEY (id),
    short_name varchar(50) NOT NULL,
    UNIQUE(short_name),
    north decimal(10,0),
);
CREATE TABLE file (
    id int(11) NOT NULL AUTO_INCREMENT,
    PRIMARY KEY (id),

    area_id int(11) NOT NULL,
    INDEX (area_id),
    FOREIGN KEY (area_id) REFERENCES area(id),

    date_time datetime NOT NULL,

    CONSTRAINT product_constr UNIQUE(area_id,date_time)
);

TableSchema JSON:

"name": "imars_product_metadata",
"resources": [
    {
        "name": "area",
        "schema": {
            "fields": [
                {
                    "name": "id",
                    "type": "integer",
                    "constraints": {
                        "required": true,
                        "unique": true
                    }
                },
                {
                    "name": "short_name",
                    "type": "string",
                    "constraints": {
                        "required": true,
                        "unique": true
                    }
                },
                {
                    "//": "NOTE: could use geojson instead",
                    "name": "north",
                    "type": "number"
                }
            ],
            "primaryKey": "id"
        }
    },
    {
        "name": "file",
        "//": "not able to do `CONSTRAINT product_constr UNIQUE(area_id,date_time)`???",
        "schema": {
            "fields": [
                {
                    "name": "id",
                    "type": "integer",
                    "constraints": {
                        "required": true,
                        "unique": true
                    }
                },
                {
                    "name": "area_id",
                    "type": "integer"
                },
                {
                    "name": "date_time",
                    "type": "datetime"
                }
            ],
            "primaryKey": "id",
            "foreignKeys": [
                "fields": "area_id",
                "reference": {
                    "resource": "area",
                    "fields": "id"
                }
            ]
        }
    }
]

SQLAlchemy python

from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import relation, sessionmaker

Base = declarative_base()

class AreaObj(Base):
    __tablename__ = 'area'

    id = Column(Integer, primary_key=True, nullable=False)
    short_name = Column(String(50), nullable=False, unique=True)
    north = Column(Decimal(10,0))

    def __init__(self, short_name=None, id=None):
        self.short_name = short_name
        self.id = id

    def __repr__(self):
        return "AreaObj(%r)" % (self.short_name)

class FileObj(Base):
    __tablename__ = 'file'

    id = Column(Integer, primary_key=True, nullable=False)
    short_name = Column(String(50), nullable=False, unique=True)

    area_id = Column(Integer, ForeignKey('area.id'), nullable=False)
    area = relation("AreaObj", backref='file', lazy=False)

    short_name = Column(DateTime, nullable=False)

    def __init__(self, short_name=None):
        self.name = short_name

    def __repr__(self):
        return "Director(%r)" % (self.short_name)
7yl4r commented 6 years ago

Note that I was not able to figure out CONSTRAINT product_constr UNIQUE(area_id,date_time) in either representation. :thinking:

7yl4r commented 6 years ago

I don't see a tool to go from MySQL to TableSchema automatically, but it is possible. One exists for PostregeSQL.

7yl4r commented 5 years ago

airflow dbapi hooks (implemented in #16) contain the get_sqlalchemy_engine method, which should make this even easier! :+1:

7yl4r commented 5 years ago

Potentially useful resources: