catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Some tables specified in ETL settings are not being processed #1203

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

PUDL DB tables which lack a natural primary key, and thus have no primary key declared in the new metadata structures, end up not having an implicit primary key either when they are created in SQLite. SQLite's normal behavior for a table with no declared primary key is to create a column called rowid which is an auto-incrementing integer pseudokey. For some reason that's not happening. I notice that in the table metadata for tables with no declared primary key there's still an explicit PrimaryKey() (Null args):

from pudl.metadata.classes import Resource, Package
from pudl.metadata.resources import RESOURCE_METADATA

check_table = "boiler_fuel_eia923"
pudl_meta = Package(name="pudl", resources=[Resource.from_id(r) for r in RESOURCE_METADATA]).to_sql()
check_meta = pudl_meta.tables[check_table]
[x for x in check_meta.constraints if isinstance(x, sa.sql.schema.PrimaryKeyConstraint)]
# [PrimaryKeyConstraint()]

When data is inserted into them it seems to vanish, so several tables end up empty after the ETL. There are several different observed behaviors, including:

Has a real PrimaryKeyConstraint, and the table is full (as expected 😀):

Empty PrimaryKeyConstraint(), but the table is full (weird! 👽):

Empty PrimaryKeyConstraint(), and no data (expected but... 😭):

To see whether a table has a rowid I've just been doing:

SELECT rowid FROM table_name LIMIT 10;

So it seems like we need to make all of the tables that lack an explicit primary key behave like boiler_fuel_eia923. Not seeing what's different about it immediately though. In particular it seems identical to generation_fuel_eia923 in that I commented out their explicit natural primary keys in the pudl.metadata.resources module because we need to update the transform step to make them unique (See: #851 and #852) so I assume the issue is not in the RESOURCE_METADATA

Looking at the definition of Resource.to_sql() I'm also not seeing what would be different between these two resources. @ezwelty is there anything that's obvious to you here?

zaneselvans commented 2 years ago

Looking at the schemata for generation_fuel_eia923 and boiler_fuel_eia923 using

sqlite-utils schema pudl.sqlite <table>

I really don't see a difference with respect to how their primary keys are set up... i.e. there are none, so they should both be getting rowid columns, but only boiler_fuel_eia923 is based on my SELECT statements.

CREATE TABLE generation_fuel_eia923 (
    plant_id_eia INTEGER, 
    report_date DATE, 
    nuclear_unit_id INTEGER, 
    fuel_type TEXT, 
    fuel_type_code_pudl TEXT, 
    fuel_type_code_aer TEXT, 
    prime_mover_code TEXT, 
    fuel_consumed_units FLOAT, 
    fuel_consumed_for_electricity_units FLOAT, 
    fuel_mmbtu_per_unit FLOAT, 
    fuel_consumed_mmbtu FLOAT, 
    fuel_consumed_for_electricity_mmbtu FLOAT, 
    net_generation_mwh FLOAT, 
    FOREIGN KEY(fuel_type_code_aer) REFERENCES fuel_type_aer_eia923 (abbr), 
    FOREIGN KEY(fuel_type) REFERENCES fuel_type_eia923 (abbr), 
    FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date), 
    FOREIGN KEY(prime_mover_code) REFERENCES prime_movers_eia923 (abbr)
)
CREATE TABLE boiler_fuel_eia923 (
    plant_id_eia INTEGER, 
    boiler_id TEXT, 
    fuel_type_code TEXT, 
    fuel_type_code_pudl TEXT, 
    report_date DATE, 
    fuel_consumed_units FLOAT, 
    fuel_mmbtu_per_unit FLOAT, 
    sulfur_content_pct FLOAT, 
    ash_content_pct FLOAT, 
    FOREIGN KEY(plant_id_eia, boiler_id) REFERENCES boilers_entity_eia (plant_id_eia, boiler_id), 
    FOREIGN KEY(fuel_type_code) REFERENCES fuel_type_eia923 (abbr), 
    FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date)
)
zaneselvans commented 2 years ago

Doing something similar but pulling from the SQLAlchemy MetaData object that was used to create the database, boy they sure do look the same. Why is only one of them getting a rowid?

from pudl.metadata.classes import Resource, Package
from pudl.metadata.resources import RESOURCE_METADATA

check_table = "generation_fuel_eia923"
pudl_meta = Package(name="pudl", resources=[Resource.from_id(r) for r in RESOURCE_METADATA]).to_sql()
check_meta = pudl_meta.tables[check_table]
print(sa.schema.CreateTable(pudl_meta.tables["generation_fuel_eia923"]).compile(pudl_engine))

CREATE TABLE generation_fuel_eia923 (
    plant_id_eia INTEGER, 
    report_date DATE, 
    nuclear_unit_id INTEGER, 
    fuel_type TEXT, 
    fuel_type_code_pudl TEXT, 
    fuel_type_code_aer TEXT, 
    prime_mover_code TEXT, 
    fuel_consumed_units FLOAT, 
    fuel_consumed_for_electricity_units FLOAT, 
    fuel_mmbtu_per_unit FLOAT, 
    fuel_consumed_mmbtu FLOAT, 
    fuel_consumed_for_electricity_mmbtu FLOAT, 
    net_generation_mwh FLOAT, 
    FOREIGN KEY(fuel_type_code_aer) REFERENCES fuel_type_aer_eia923 (abbr), 
    FOREIGN KEY(fuel_type) REFERENCES fuel_type_eia923 (abbr), 
    FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date), 
    FOREIGN KEY(prime_mover_code) REFERENCES prime_movers_eia923 (abbr)
)
print(sa.schema.CreateTable(pudl_meta.tables["boiler_fuel_eia923"]).compile(pudl_engine))
CREATE TABLE boiler_fuel_eia923 (
    plant_id_eia INTEGER, 
    boiler_id TEXT, 
    fuel_type_code TEXT, 
    fuel_type_code_pudl TEXT, 
    report_date DATE, 
    fuel_consumed_units FLOAT, 
    fuel_mmbtu_per_unit FLOAT, 
    sulfur_content_pct FLOAT, 
    ash_content_pct FLOAT, 
    FOREIGN KEY(plant_id_eia, boiler_id) REFERENCES boilers_entity_eia (plant_id_eia, boiler_id), 
    FOREIGN KEY(fuel_type_code) REFERENCES fuel_type_eia923 (abbr), 
    FOREIGN KEY(plant_id_eia, report_date) REFERENCES plants_eia860 (plant_id_eia, report_date)
)
zaneselvans commented 2 years ago

Maybe I am missing something but looking at the WITHOUT ROWID docs from SQLite it sounds like it's not supposed to be possible to create a table that has no rowid and also has no PRIMARY KEY sooooo... still very confused.

I also note that the new DB that is having this problem is exactly the same size as the previous DB, which didn't have this issue so the data is all in there somewhere, lost, and impossible to query apparently!

-rw-r--r-- 1 zane zane 616669184 Aug 14 20:04 pudl-2021-08-14.sqlite
-rw-r--r-- 1 zane zane 616669184 Sep  7 00:34 pudl.sqlite
zaneselvans commented 2 years ago

Now if I cut-and-paste the CREATE statements into SQLite, I get empty tables from which I can't select any rowid values but maybe that's just because they're empty. Creating these tables using the above schemas and inserting a row into each of them interactively works fine. They both end up with an automatically generated rowid column which serves as a pseudo key for the table.

ezwelty commented 2 years ago

The empty PrimaryKeyConstraint seems to be the sqlalchemy default:

import sqlalchemy as sa

t = sa.Table('table', sa.MetaData(), sa.Column('x'))
t.constraints
# {PrimaryKeyConstraint()}

As for this bizarre and different behavior between different tables lacking primary keys, I'm really not sure. You say that creating and loading data into SQLite is working interactively. So perhaps the problem is in the connection between Python and the database. Can you provide example code that reproduces the diverging behaviors you are describing? (i.e. "When data is inserted into them it seems to vanish")

zaneselvans commented 2 years ago

It turns out this was happening because no data was being written to the tables at all. I messed up the parsing of the ETL parameters and was accidentally excluding the empty tables from ever being written to. And the rowid field is only created once you've actually inserted rows into the table so... that's why it wasn't showing up. I've fixed the parameter parsing and now it seems to be acting normally again!

The "vanishing data" thing still seems odd to me though, like why was this DB exactly the same size as a DB I had output 3 weeks ago which worked fine? Seems very strange. I must be misunderstanding something.