catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
469 stars 108 forks source link

Create infrastructure to extract data from FERC XBRL filings #1568

Closed zschira closed 2 years ago

zschira commented 2 years ago

Background

Before we can begin integrating XBRL data into existing pipelines (like PUDL and RMI's pipeline) we must develop tools for working with XBRL data. This epic tracks the ongoing work to develop a tool for extracting data from XBRL filings, as well other important infrastructure, like archiving filings distributed with FERC's new RSS feed.

Known irregularities

Many of the tables in the XBRL data do not have any equivalent to row_num and spplmnt_num from the historical data. These fields are used to uniquely identify records, so this seems like a problem, but I believe an equivalent to these fields is (hopefully) always included when records can not be uniquely identified by other fields. For example, the table 410 - Schedule - Generating Plant Statistics (which is equivalent to f1_gnrt_plant) has a column GeneratingPlantStatisticsAxis, which contains values in the form {spplmnt_num}-{row_num}.

The easiest way to automate the transformation from extracted XBRL data to be compatible with historical data is to use the order of columns in tables. This is because the column names are different enough to make it difficult to match columns by name. This is explored in this notebook. Unfortunately, the columns are not always in the same order, however. For example, in the table f1_steam the column asset_retire_cost is near the end of the table among footnote columns, while it is towards the middle in the equivalent XBRL table. The rest of the columns are in the same order, but this one column needs to be accounted for in some way.

While most tables contain essentially the same structure with different column names, there are some columns with different structures. For example, f1_plant_in_srvce contains the columns begin_yr_bal and yr_end_bal. In the XBRL data, however, these values are reported in the same column, but with different dates to identify them. This is not a particularly difficult situation to deal with, but irregularities like this may prove to be difficult to identify in an automated way.

List of tables used by RMI

Respondent ids and names

Balance sheet (assets) breakdown

Additional table handed to transmission team:

bendnorman commented 2 years ago

Can this epic be closed @zschira ?

zschira commented 2 years ago

Yeah this should be closed