catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Create a FERC Form 1 Super DB Schema #288

Closed zaneselvans closed 5 years ago

zaneselvans commented 5 years ago

Currently the multi-annual FERC Form 1 DB that we build is based on a schema that's an awkward hybrid of the mappings between unreadable DBF filenames and actual table names that we found online, which pertains to the 2015 data, and a our parsing of the strings in the DBC file for whatever reference year we're trying to build a database for -- typically the most recent data release year.

This means that there are some kinds of data that we could be missing -- if tables and/or columns existed in the past but don't exist now, we'll probably miss them when reading in earlier years of data. This doesn't appear to be common, but it could trip someone up.

The "right way" to fix this seems like it's probably to create a super-schema that contains all of the tables that have ever existed and all of the columns that have ever existed, by parsing the DBC files for all of the data years... and crossing our fingers that the data types didn't change over the years, or there's one data type that can be imposed on each column which all the data can be forced into. We would also need to figure out the historical mappings between database table names and F1_whatevs.DBF filenames.

zaneselvans commented 5 years ago

I have verified that the 2015 & 2017 FERC Form 1 database schemas are identical, and that they include all the tables and fields which exist in all previous years of data going back as far as 1994, so the super-schema would super-fluous. An ETL test has been integrated (test/etl_test.py::test_ferc1_lost_data()) which verifies that this situation contiues to be true (and checks whether any new tables have been added in future years, relative to our 2015 mapping of DBF files to database table names, so we can update that if need be.