Refactor `fuel_ferc1` transform for XBRL + DBF inputs

cmgosnell commented 2 years ago

Adapt the fuel_ferc1 transformation process to use the new abstractions developed in #1739, and to accommodate raw inputs from both the old DBF and new XBRL data.

DBF specific transforms

[x] rename columns (inherited)
[x] drop footnote columns (inherited)
[x] assign source record IDs (DBF) (inherited)
[x] drop remaining unused DBF columns
[x] normalize_strings (formerly simplify strings)
[x] categorize_strings (since it has to happen independently for XBRL)
[x] convert NA categorized strings to pd.NA values
[x] convert units (they're different for XBRL & DBF)

XBRL specific transforms

[x] rename columns (inherited)
[x] merge instant + duration tables
[x] categorize strings (required for record ID generation)
[x] aggregate duplicate fuel records
[x] assign source record IDs (XBRL)
[x] convert units (they're different for XBRL & DBF)
[x] normalize_strings (formerly simplify strings)
[x] Assign utility_id_ferc1 (stand-in)

Table-specific post-concatentation transforms

[x] nullify_outliers (formerly oob_to_nan)
[x] Drop rows missing all data columns
[x] multiplicative data entry errors / unit corrections
- [x] parameterize the corrections for transparency

Generic FERC Form 1 final transformations

[x] Drop any remaining columns that aren't part of the target DB table
[x] Enforce PUDL data types on the remaining columns

Must Fix

[x] Unit conversion issues:
- [x] Review and clarify old DBF renaming for fuel table to ensure column names always reflect column contents.
- [x] Use the categorized fuel_units column to do a first round of unit standardization before attempting to correct units.
- [x] Update the fuel_units column to reflect the results of our error corrections and initial units assumptions.
- [x] Update allowed heat content ranges to reflect the values in energy_sources_eia table.

Other Loose Ends

[x] Require string categories to include the string they map to so they are idempotent
[x] Warn when there are uncategorized strings in the column being categorized
[x] Verify that string categories are disjoint sets
[x] Figure out why some strings that are categorized are showing up as uncategorized...
[x] Separate normalize_strings functionality from categorize_strings
[x] Make the na category in categorize_strings more unique.
[x] Replace source_ferc1: Literal["dbf", "xbrl"] with a Pydantic model to simplify error checking everywhere.
[x] Require that Ferc1AbstractTableTransformer.table_id be a valid ferc1 database table name.
[x] Make sure every transform step has some logging (at least at DEBUG level). Make sure multi-column transforms are created with an appropriate name (they shouldn't all be called the same thing...)

Issues resulting from or related to this issue

Allow NA values in fuel_type_code_pudl See issue #1344
Replace cleanstrings with categorize_strings + normalize_strings elsewhere in the codebase. See #1770
Replace simplify_strings with normalize_strings elsewhere in the codebase. See #1771
1875
1876
1877
1878

FERC 1 specific questions:

Question: The XBRL columns (filing_name, index) uniquely identify a lot of data. Should we really be dropping them? Would they not be appropriate for the record_id values on the XBRL side?
- Answer: Because the same entity can submit multiple updated filings for the same plant in the same year, this UUID is necessary in the XBRL / RSS feed to identify unique filings. However we only extract the most recent filing by each responding entity when we convert to the SQLite DB, so we don't need to retain these UUIDs to ensure unique filings at this stage in the process.

Note: the fuel_ferc1 transform has to be done before plants_steam_ferc1 because the algorithmic assignment of plant_id_ferc1 values depends on fuel information, so this issue is blocking #1707

cmgosnell commented 2 years ago

There are 126 records which have two records per plant that have the same fuel_type_code_pudl. the original FuelTypeAxis is unique. but the cleaned codes are not.

because of this, it is breaking the new convention that we can use the Axis xbrl columns as primary keys in the creation of the record_id.

We could use some hash or something other than a composite key, but in attempt to preserve the primary key, the idea right now is to condense/aggregate these duplicate records.

condense them

there are a slew of records like this:

where one is effectively empty except for one data point.

for these, i built a little function called condense_sets_of_records_with_one_datapoint_in_second_record (omigosh please halp me rename this)

aggregate them

i'll probably use pudl.helpers.sum_and_weighted_average_agg()

delete/null them?

these nuclear records seem.... unsavable. idk what to do with them

zaneselvans commented 2 years ago

I'm calling this issue ready for review, and have created several separate smaller issues pertaining to testing & transform parameter validation, which I'll move on to now @cmgosnell:

1878
1877
1876

zaneselvans commented 2 years ago

With #1903 and #1900 getting merged into xbrl_steam this issue is done.

catalyst-cooperative / pudl