catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Outline process for manual/automated integration of XBRL data #1579

Closed zschira closed 2 years ago

zschira commented 2 years ago

The process for extracting/transforming FERC data to be compatible with historic FERC data is difficult to impossible to fully automate. In some cases, like the f1_fuel table, this is fairly easy. However, in others, like f1_plant_in_srvce the structure of the data differs significantly, and it's difficult to link XBRL data with historical data. This Visual FoxPro version of this table has the columns:

begin_yr_bal, addition, retirements, adjustments, transfers, yr_end_bal

While in the equivalent XBRL table, these 6 columns are split out into well over 100 columns, where each of those columns corresponds to a specific row_number in the historical data, and the begin_yr_bal and yr_end_bal have to be deduced using dates.

FERC's XBRL taxonomy includes References (more info here), which provide more info on an XBRL fact's physical location in the form, which is sometimes useful for dealing with problems like the one highlighted above.

The persistent problem is deciding when/what information to use to attempt automatically link data, and when manual intervention is required. This issue will track the development of a unified process to be applied to the initial set of tables of interest during the integration process.

zschira commented 2 years ago

List of tables used by RMI

Respondent ids and names

Balance sheet (assets) breakdown

Additional table handed to transmission team:

zschira commented 2 years ago

Strategy

After some discussion on the status of XBRL integration, we've decided to move away from trying to recreate the historical database. The structure of the data extracted from XBRL filings is an improvement over the historical data, and it probably doesn't make sense to try and go back to the worse structure. For the sake of PUDL, we will develop and maintain two separate ETL's for old/new FERC data. We will, however, provide estimates of the difficulty to create a mapping between new tables and old tables for the specific tables RMI is currently using (#1594).