zaneselvans commented 2 years ago

Description

Change the PUDL data processing pipeline to write many of its outputs directly to a database, rather than a bundle of tabular data packages made up of CSV and JSON files. For cost-effective storage of data at rest and easy bulk distribution and archiving, the final output we are targeting is a SQLite database.

Motivation

At a high level, this epic is about making it easier for us to do frequent, comprehensive data releases, and making it easier for our users to access the data that we publish.

Simplify and speed up the data processing pipeline by reducing the number of steps it takes to go from raw inputs to usable relational data.
Ensure that our outputs are easy to archive and distribute in bulk, while preserving the full richness of their internal relationships, constraints, and data types.
Simplify our codebase and make it more maintainable. A lot of the datapackage code is fragile and not reusable.
Provide a primary output that is easier to load into other relational database management systems.
Enable the use the efficient, built-in data validation and constraint checking tools that are part of most relational databases, rather than needing to separately validate the data before it is loaded (currently done using goodtables-pandas-py.
Enable the future integration of denormalized database views directly into the data products we distribute, avoiding the need to provide that functionality via a software layer, and making the data easier to use with a variety of tools.
Enable future concurrent output from distributed ETL processes into a database like PostgreSQL or BigQuery, which is subsequently exported to SQLite for storage at rest and bulk distribution.

In Scope

How do we know when we are done?

The ETL pipeline outputs to SQLite directly (and Parquet, for EPA CEMS)
Primary and foreign key relationships are being validated by the database during or immediately after output.
Data types are being validated by the database during or immediately after output.
Value constraints (min/max values, allowable categorical values, etc.) are being validated by the database during or immediately after output.
All the code related to data package output and validation has been removed from the codebase (mostly this is in pudl.load.csv and pudl.load.metadata)
We no longer need the datapkg_to_sqlite script.
The unit, integration, and data validation tests all pass again.
The pudl-examples notebooks work again.

Out of Scope

Integration of additional data validation and constraint checks beyond what we are currently doing.
Using databases for storage of interim outputs within the ETL.
Making the ETL work on distributed systems (writing to a non-local database resource)
Integrating denormalized tables or analytical outputs into the database as views or new tables

Notes

Tasks / Progress

These are things that I did before migrating the set of issues/tasks into this epic container:

[x] FERC 1 and EIA data load but foreign key constraints are failing Without foreign key constraint checking enabled all of the data for ferc1, eia860, eia860m, and eia923 can be loaded directly into SQLite. Once the dataframes have been generated it only takes about 3 minutes to load them! Clearly there have been some major improvements to df.to_sql() in the last 4 years. Memory usage is ~6GB peak (far less than other parts of the ETL like the ferc1 plant ID assignments). If this gets to be too large with future datasets, or when we go toward a distributed ETL process (see #1190) we can use @rousik's serialized dataframe collections.
[x] Direct SQLite loading has been integrated into pudl.etl and pudl.load.sqlite modules. The top-level ETL module now does all the SQLite data processing first, and then moves on to the Parquet data, and has the SQLite DB available to pull data out of.
[x] Update the pudl_etl script to use the new arrangement appropriately.

Other Questions / Related Issues

1194 In SQLAlchemy 1.4, it looks like there's a different behavior for Integer primary keys, where they aren't automatically set to nullable=False, which is causing one of the doctests in pudl.metadata.classes.Resource to fail.
851 and #852 The new metadata reflected the true natural primary keys, but we haven't fixed the transform functions to ensure that those keys are unique. I commented out the primaryKey definitions for these tables in pudl.metadata.resources for now to get an automatic pseudo key, which is what we've been doing.
1210 A fair number of columns with ENUM constraints contain NA/None/NaN values. What's the right way to specify them in the enumeration? At least in the string ENUMs adding the empty string "" seems to work (and I did this to several of them) but that's not how it's supposed to work.. We need to make sure we have the right NA value in these columns where data is missing so that the ENUM constraint doesn't fail.
There's something I'm not understanding about how to import / reload the pudl.metadata module. If I change parts of the RESOURCE_METADATA or FIELD_METADATA constructs, the normal %autoreload 2 magic in my notebook doesn't pick up the changes, so I'm having to reload the modules every time manually with importlib.reload()

ezwelty commented 2 years ago

There's something I'm not understanding about how to import / reload the pudl.metadata module. If I change parts of the RESOURCE_METADATA or FIELD_METADATA constructs, the normal %autoreload 2 magic in my notebook doesn't pick up the changes, so I'm having to reload the modules every time manually with importlib.reload()

Well,%autoreload is witchcraft and comes with many caveats. I wouldn't rely on it, nor design code around it. The documentation states:

Functions and classes imported via ‘from xxx import foo’ are upgraded to new versions when ‘xxx’ is reloaded."

But RESOURCE_METADATA is just a dict. You can see the limitations play out in this example:

x = 1
def f():
  return x
class Class:
  def __init__(self, x):
    self.x = x
c = Class(x)

%load_ext autoreload
%autoreload 2
from module import x, f, class
# Change x = 2
x  # 1
c.x  # 1
f()  # 2

A fair number of columns with ENUM constraints contain NA/None/NaN values. What's the right way to specify them in the enumeration? At least in the string ENUMs adding the empty string "" seems to work (and I did this to several of them) but I don't know if that's how we're supposed to do it. Would making the columns with ENUM constraints explicitly nullable do the same thing?

Ack, please remove "" from enums! All columns are nullable (aka required=False) by default. So field.constraints.enum=[list, of, non-missing, values] is all you need. What do you mean by "work" in this context?

zaneselvans commented 2 years ago

I was getting constraint violation errors on categorical columns which apparently contained the empty string, alongside the enumerated values. So the way this is supposed to work is that these fields would contain "real" NA values by the time they're being inserted into the DB? Which I guess means pd.NA in this case since the enumerations are string values? Maybe it was just the wrong kind of Null. Like None rather than pd.NA or something. I enumerated all the columns that currently depend on ENUMs that have "" in them in #1210 so we can hunt them down and fix the data processing to give the right output.

On the %autoreload -- it's just for working in notebooks in development conveniently and being able to bounce back and forth between the module and testing stuff interactively. It's definitely not in the modules anywhere.

ezwelty commented 2 years ago

So the way this is supposed to work is that these fields would contain "real" NA values by the time they're being inserted into the DB?

In my opinion, missing values should be cast to null as early as possible. There shouldn't be any "" (or other value) standing in for null by harvest, and certainly not by export. I commented in #1210 regarding a few other placeholder values that should be replaced with null.

p.s. pd.CategoricalDtype uses pd.nan, not pd.NA (see https://github.com/pandas-dev/pandas/issues/36586). If using pd.NA throughout pudl is desired, then you could use pd.StringDtype instead.

import pandas as pd
s = pd.Series(['a', pd.NA, 'c']).astype('category')
s[1]
# nan

catalyst-cooperative / pudl

Direct SQLite & Parquet Outputs #1176

Description

Motivation

In Scope

Out of Scope

Notes

Tasks / Progress

Other Questions / Related Issues

1194 In SQLAlchemy 1.4, it looks like there's a different behavior for Integer primary keys, where they aren't automatically set to `nullable=False`, which is causing one of the doctests in `pudl.metadata.classes.Resource` to fail.

catalyst-cooperative / pudl

Direct SQLite & Parquet Outputs #1176

Description

Motivation

In Scope

Out of Scope

Notes

Tasks / Progress

Other Questions / Related Issues

1194 In SQLAlchemy 1.4, it looks like there's a different behavior for Integer primary keys, where they aren't automatically set to nullable=False, which is causing one of the doctests in pudl.metadata.classes.Resource to fail.

1194 In SQLAlchemy 1.4, it looks like there's a different behavior for Integer primary keys, where they aren't automatically set to `nullable=False`, which is causing one of the doctests in `pudl.metadata.classes.Resource` to fail.