catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
477 stars 110 forks source link

Compile row maps for key FERC Form 1 tables #465

Closed zaneselvans closed 1 year ago

zaneselvans commented 5 years ago

Many of the FERC Form 1 tables of interest report data as a function of both a numbered row, and a named (database) column. Unfortunately, while the database columns have remained almost completely unchanged over the years for which the digital data is available, the row numbers associated with individual variables changes over time, as new rows are inserted or (possibly?) as existing rows are deleted. This means the ETL can't reliably map a given row-column combination from the raw FERC 1 database to a particular output column in the PUDL database.

To get around this problem, we need to create a year-by-year mapping of FERC 1 row number to PUDL DB column name, and for each of those rows-become-columns, there will then also be N-columns in the PUDL output table, corresponding to each of the columns that the row had a number in in the original FERC 1 DB (often these are things like income, expense, and the net value of a financial quantity).

This process is similar to what we've done for the EIA 860 and EIA 923 in the tab maps, which keep track of which spreadsheet column names map to what PUDL database column names on a year-by-year basis.

Initially the FERC 1 tables of interest include those that we've already integrated... to ensure that we have all the rows coming in correctly for all the years, given that we're publishing that data now:

For each one of these tables, and for each year, we need to map the reported row numbers to newly named PUDL DB columns. The contents of a given row number in a given FERC 1 table for a given year can be looked up in the f1_row_lit_tbl FERC 1 database table. Those row names need to be used to generate column names, and for each of those columns, the appropriate row number within the table needs to be recorded in the row maps. If a column does not exist in a given year, it is given a "row number" of -1 to indicate it is missing.

This process is tedious, but not super difficult. The most challenging part is probably ensuring that we use naming conventions which are as consistent as possible, both within the individual tables and between them.

zaneselvans commented 5 years ago

Useful quickie helper function for grabbing the info required to make these maps...

def get_row_literals(table_name, report_year, ferc1_engine):
    row_literals = (
        pd.read_sql("f1_row_lit_tbl", ferc1_engine)
        .query(f"sched_table_name=='{table_name}'")
        .query(f"report_year=={report_year}")
        .sort_values("row_number")
    )
    return row_literals
zaneselvans commented 1 year ago

This has been subsumed by our XBRL work. See #1574