catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

SQLite not checking integrity of foreign key constraints? #787

Closed zaneselvans closed 3 years ago

zaneselvans commented 3 years ago

I just attempted to run the ETL on ferc1, eia860 and eia923 including the 2019 data, even though the EIA-to-FERC glue (assigning plants and utilities their PUDL IDs by hand) hasn't been done yet. I expected the datapackages to get generated just fine, but for the loading of those datapackages into SQLite to fail. However, that's not what happened.

So, are the foreign key constraints ever getting tested? We've been relying on the database ingest to validate this aspect of the database structure.

zaneselvans commented 3 years ago

Holycrap, SQLite does not enforce foreign key constraints by default :facepalm:

However you can turn enforcement on within any given SQLite connection Example here. To make this work in our case we need to use hooks that automatically emit the

PRAGMA foreign_keys=ON;

every time an SQLite3 connection is made, since that's happening down inside of the Datapackage.save() method using an engine that we pass in.

I suspect we are going to have a bunch of bad foreign key relationships to chase down now...

zaneselvans commented 3 years ago

The datapkg_to_sqlite script now has a --fkey option which tells it to enforce these relationships and it seems to be functioning well, so I think this issue has been addressed.