catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Direct SQLite & Parquet Outputs #1176

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

Description

Change the PUDL data processing pipeline to write many of its outputs directly to a database, rather than a bundle of tabular data packages made up of CSV and JSON files. For cost-effective storage of data at rest and easy bulk distribution and archiving, the final output we are targeting is a SQLite database.

Motivation

At a high level, this epic is about making it easier for us to do frequent, comprehensive data releases, and making it easier for our users to access the data that we publish.

In Scope

How do we know when we are done?

Out of Scope

Notes

Tasks / Progress

These are things that I did before migrating the set of issues/tasks into this epic container:

Other Questions / Related Issues

ezwelty commented 2 years ago

There's something I'm not understanding about how to import / reload the pudl.metadata module. If I change parts of the RESOURCE_METADATA or FIELD_METADATA constructs, the normal %autoreload 2 magic in my notebook doesn't pick up the changes, so I'm having to reload the modules every time manually with importlib.reload()

Well,%autoreload is witchcraft and comes with many caveats. I wouldn't rely on it, nor design code around it. The documentation states:

Functions and classes imported via ‘from xxx import foo’ are upgraded to new versions when ‘xxx’ is reloaded."

But RESOURCE_METADATA is just a dict. You can see the limitations play out in this example:

x = 1
def f():
  return x
class Class:
  def __init__(self, x):
    self.x = x
c = Class(x)
%load_ext autoreload
%autoreload 2
from module import x, f, class
# Change x = 2
x  # 1
c.x  # 1
f()  # 2

A fair number of columns with ENUM constraints contain NA/None/NaN values. What's the right way to specify them in the enumeration? At least in the string ENUMs adding the empty string "" seems to work (and I did this to several of them) but I don't know if that's how we're supposed to do it. Would making the columns with ENUM constraints explicitly nullable do the same thing?

Ack, please remove "" from enums! All columns are nullable (aka required=False) by default. So field.constraints.enum=[list, of, non-missing, values] is all you need. What do you mean by "work" in this context?

zaneselvans commented 2 years ago

I was getting constraint violation errors on categorical columns which apparently contained the empty string, alongside the enumerated values. So the way this is supposed to work is that these fields would contain "real" NA values by the time they're being inserted into the DB? Which I guess means pd.NA in this case since the enumerations are string values? Maybe it was just the wrong kind of Null. Like None rather than pd.NA or something. I enumerated all the columns that currently depend on ENUMs that have "" in them in #1210 so we can hunt them down and fix the data processing to give the right output.

On the %autoreload -- it's just for working in notebooks in development conveniently and being able to bounce back and forth between the module and testing stuff interactively. It's definitely not in the modules anywhere.

ezwelty commented 2 years ago

So the way this is supposed to work is that these fields would contain "real" NA values by the time they're being inserted into the DB?

In my opinion, missing values should be cast to null as early as possible. There shouldn't be any "" (or other value) standing in for null by harvest, and certainly not by export. I commented in #1210 regarding a few other placeholder values that should be replaced with null.

p.s. pd.CategoricalDtype uses pd.nan, not pd.NA (see https://github.com/pandas-dev/pandas/issues/36586). If using pd.NA throughout pudl is desired, then you could use pd.StringDtype instead.

import pandas as pd
s = pd.Series(['a', pd.NA, 'c']).astype('category')
s[1]
# nan