create standard ferc1 (or sqlite?) extractor

cmgosnell commented 2 years ago

Our ferc1 extract step is... ALOT of copy and pasting w/ mostly just a standard sql query with a table name and mayyybe one join. Or a specific select that should really be in the transform step.

Each output PUDL table will be derived from 1 or 2 input XBRL tables:

timestamp (instantaneous) columns
period (duration) columns

If the conditional select-and-merge of these tables can be done programmatically without any information beyond what can be found in the XBRL derived SQL DB, and a mapping between PUDL and XBRL table names, then it can be kept in the extract step. If necessary, store information about which tables have instant, vs. duration, vs both in the extract module / step.

The hope is that this can be done entirely programmatically while storing only a 1-to-1 mapping of XBRL table to PUDL table names.

[x] Define a mapping of PUDL table names to XBRL root table names for the tables we are going to transform.
[x] Create a function that, given the root name of an XBRL table, identify whether it has duration, instant, or both kinds of data in the XBRL DB.
[ ] Create a function that, depending on what combo of those tables appears in the XBRL DB, can merge them together (using SQL or pandas) appropriately into a single dataframe to be returned for transformation (leaving this for transform step).
[x] Create an XBRL extract wrapper that given a Ferc1Settings object, can look up the root of the corresponding XBRL table names, and iterate over them extracting the appropriate years of data as single dataframes, and returning a dictionary of dataframes.
[x] Create a higher level extract wrapper that, given a Ferc1Settings object, compiles a dictionary of both XBRL and DBF derived dataframes.

Other required changes

[x] Update the Ferc1Settings objects to store same PUDL output tables for both XBRL and DBF inputs. Since XBRL and DBF years are disjoint sets, we could also just dispense with having different settings objects, and have it do the right thing internally based on which year of data is being processed. This logic should probably be in the FERC1 extract module.
[x] Tweak transform input expectations to take a single dataframe for each input table (assuming duration/instant merge happens in extract step uniformly) and adapt the existing merge code that Christina has written over to the extract module
[x] (Potentially) also switch to a much simpler DBF extract step which uses a 1-to-1 PUDL table to DBF table name mapping, and simply selects the appropriate years of data from the FERC 1 DBF derived SQLite DB, moving the current dropping of some rows etc. into the existing transform functions.

zaneselvans commented 2 years ago

I think what I would want to do here is probably go to a fairly pure SQL solution. Each table just has an SQL query that extracts it from the source DB (currently SQLite) in its entirety, and hands it off as a dataframe. If there's a join that needs to happen to incorporate other required information that's only available in the DB, that would happen in the extraction too. But no aggregation, or dropping of records or columns etc. -- leave all of that for the transform step.

zschira commented 2 years ago

I have created a generic extractor for both the DBF and XBRL data that uses raw SQL queries to select the desired tables/years specified in Ferc1Settings object. It also consolidates the settings so there is only one settings object for both sets of data. The implementation I've created will return a dictionary of dataframes for each dataset. The DBF dictionary will look unchanged, while the XBRL dictionary will have a nested dictionary which maps the instant/duration specific tables:

{
    "table_name": {
        "instant": extracted_instant_dataframe,
        "duration": extracted_duration_dataframe,
    },
    ...
}

This will leave the actual joining of the two tables to the transform step. It will also leave the filtering that was being done in extract to the transform step. I've not made the changes to the transform step yet to avoid conflicts.

zschira commented 2 years ago

Closing as I've handed off the updated extract functions and merged to steam transform branch.

catalyst-cooperative / pudl

create standard ferc1 (or sqlite?) extractor #1738

Other required changes