Closed zaneselvans closed 2 years ago
There's something I'm not understanding about how to import / reload the pudl.metadata module. If I change parts of the RESOURCE_METADATA or FIELD_METADATA constructs, the normal %autoreload 2 magic in my notebook doesn't pick up the changes, so I'm having to reload the modules every time manually with importlib.reload()
Well,%autoreload
is witchcraft and comes with many caveats. I wouldn't rely on it, nor design code around it. The documentation states:
Functions and classes imported via ‘from xxx import foo’ are upgraded to new versions when ‘xxx’ is reloaded."
But RESOURCE_METADATA
is just a dict
. You can see the limitations play out in this example:
x = 1
def f():
return x
class Class:
def __init__(self, x):
self.x = x
c = Class(x)
%load_ext autoreload
%autoreload 2
from module import x, f, class
# Change x = 2
x # 1
c.x # 1
f() # 2
A fair number of columns with ENUM constraints contain NA/None/NaN values. What's the right way to specify them in the enumeration? At least in the string ENUMs adding the empty string "" seems to work (and I did this to several of them) but I don't know if that's how we're supposed to do it. Would making the columns with ENUM constraints explicitly nullable do the same thing?
Ack, please remove ""
from enums! All columns are nullable (aka required=False) by default. So field.constraints.enum=[list, of, non-missing, values]
is all you need. What do you mean by "work" in this context?
I was getting constraint violation errors on categorical columns which apparently contained the empty string, alongside the enumerated values. So the way this is supposed to work is that these fields would contain "real" NA values by the time they're being inserted into the DB? Which I guess means pd.NA
in this case since the enumerations are string values? Maybe it was just the wrong kind of Null. Like None
rather than pd.NA
or something. I enumerated all the columns that currently depend on ENUM
s that have ""
in them in #1210 so we can hunt them down and fix the data processing to give the right output.
On the %autoreload
-- it's just for working in notebooks in development conveniently and being able to bounce back and forth between the module and testing stuff interactively. It's definitely not in the modules anywhere.
So the way this is supposed to work is that these fields would contain "real" NA values by the time they're being inserted into the DB?
In my opinion, missing values should be cast to null
as early as possible. There shouldn't be any ""
(or other value) standing in for null
by harvest, and certainly not by export. I commented in #1210 regarding a few other placeholder values that should be replaced with null
.
p.s. pd.CategoricalDtype
uses pd.nan
, not pd.NA
(see https://github.com/pandas-dev/pandas/issues/36586). If using pd.NA
throughout pudl
is desired, then you could use pd.StringDtype
instead.
import pandas as pd
s = pd.Series(['a', pd.NA, 'c']).astype('category')
s[1]
# nan
Description
Change the PUDL data processing pipeline to write many of its outputs directly to a database, rather than a bundle of tabular data packages made up of CSV and JSON files. For cost-effective storage of data at rest and easy bulk distribution and archiving, the final output we are targeting is a SQLite database.
Motivation
At a high level, this epic is about making it easier for us to do frequent, comprehensive data releases, and making it easier for our users to access the data that we publish.
goodtables-pandas-py
.In Scope
How do we know when we are done?
pudl.load.csv
andpudl.load.metadata
)datapkg_to_sqlite
script.pudl-examples
notebooks work again.Out of Scope
Notes
Tasks / Progress
These are things that I did before migrating the set of issues/tasks into this epic container:
df.to_sql()
in the last 4 years. Memory usage is ~6GB peak (far less than other parts of the ETL like the ferc1 plant ID assignments). If this gets to be too large with future datasets, or when we go toward a distributed ETL process (see #1190) we can use @rousik's serialized dataframe collections.pudl.etl
andpudl.load.sqlite
modules. The top-level ETL module now does all the SQLite data processing first, and then moves on to the Parquet data, and has the SQLite DB available to pull data out of.pudl_etl
script to use the new arrangement appropriately.Other Questions / Related Issues
1194 In SQLAlchemy 1.4, it looks like there's a different behavior for Integer primary keys, where they aren't automatically set to
nullable=False
, which is causing one of the doctests inpudl.metadata.classes.Resource
to fail.851 and #852 The new metadata reflected the true natural primary keys, but we haven't fixed the transform functions to ensure that those keys are unique. I commented out the
primaryKey
definitions for these tables inpudl.metadata.resources
for now to get an automatic pseudo key, which is what we've been doing.1210 A fair number of columns with
ENUM
constraints contain NA/None/NaN values. What's the right way to specify them in the enumeration? At least in the stringENUM
s adding the empty string "" seems to work (and I did this to several of them) but that's not how it's supposed to work.. We need to make sure we have the right NA value in these columns where data is missing so that theENUM
constraint doesn't fail.pudl.metadata
module. If I change parts of theRESOURCE_METADATA
orFIELD_METADATA
constructs, the normal%autoreload 2
magic in my notebook doesn't pick up the changes, so I'm having to reload the modules every time manually withimportlib.reload()