catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

create standard ferc1 (or sqlite?) extractor #1738

Closed cmgosnell closed 2 years ago

cmgosnell commented 2 years ago

Our ferc1 extract step is... ALOT of copy and pasting w/ mostly just a standard sql query with a table name and mayyybe one join. Or a specific select that should really be in the transform step.

Each output PUDL table will be derived from 1 or 2 input XBRL tables:

If the conditional select-and-merge of these tables can be done programmatically without any information beyond what can be found in the XBRL derived SQL DB, and a mapping between PUDL and XBRL table names, then it can be kept in the extract step. If necessary, store information about which tables have instant, vs. duration, vs both in the extract module / step.

The hope is that this can be done entirely programmatically while storing only a 1-to-1 mapping of XBRL table to PUDL table names.

Other required changes

zaneselvans commented 2 years ago

I think what I would want to do here is probably go to a fairly pure SQL solution. Each table just has an SQL query that extracts it from the source DB (currently SQLite) in its entirety, and hands it off as a dataframe. If there's a join that needs to happen to incorporate other required information that's only available in the DB, that would happen in the extraction too. But no aggregation, or dropping of records or columns etc. -- leave all of that for the transform step.

zschira commented 2 years ago

I have created a generic extractor for both the DBF and XBRL data that uses raw SQL queries to select the desired tables/years specified in Ferc1Settings object. It also consolidates the settings so there is only one settings object for both sets of data. The implementation I've created will return a dictionary of dataframes for each dataset. The DBF dictionary will look unchanged, while the XBRL dictionary will have a nested dictionary which maps the instant/duration specific tables:

{
    "table_name": {
        "instant": extracted_instant_dataframe,
        "duration": extracted_duration_dataframe,
    },
    ...
}

This will leave the actual joining of the two tables to the transform step. It will also leave the filtering that was being done in extract to the transform step. I've not made the changes to the transform step yet to avoid conflicts.

zschira commented 2 years ago

Closing as I've handed off the updated extract functions and merged to steam transform branch.