zschira commented 2 years ago

Background

Before we can begin integrating XBRL data into existing pipelines (like PUDL and RMI's pipeline) we must develop tools for working with XBRL data. This epic tracks the ongoing work to develop a tool for extracting data from XBRL filings, as well other important infrastructure, like archiving filings distributed with FERC's new RSS feed.

Known irregularities

Many of the tables in the XBRL data do not have any equivalent to row_num and spplmnt_num from the historical data. These fields are used to uniquely identify records, so this seems like a problem, but I believe an equivalent to these fields is (hopefully) always included when records can not be uniquely identified by other fields. For example, the table 410 - Schedule - Generating Plant Statistics (which is equivalent to f1_gnrt_plant) has a column GeneratingPlantStatisticsAxis, which contains values in the form {spplmnt_num}-{row_num}.

The easiest way to automate the transformation from extracted XBRL data to be compatible with historical data is to use the order of columns in tables. This is because the column names are different enough to make it difficult to match columns by name. This is explored in this notebook. Unfortunately, the columns are not always in the same order, however. For example, in the table f1_steam the column asset_retire_cost is near the end of the table among footnote columns, while it is towards the middle in the equivalent XBRL table. The rest of the columns are in the same order, but this one column needs to be accounted for in some way.

While most tables contain essentially the same structure with different column names, there are some columns with different structures. For example, f1_plant_in_srvce contains the columns begin_yr_bal and yr_end_bal. In the XBRL data, however, these values are reported in the same column, but with different dates to identify them. This is not a particularly difficult situation to deal with, but irregularities like this may prove to be difficult to identify in an automated way.

List of tables used by RMI

Respondent ids and names

f1_respondent_id

Balance sheet (assets) breakdown

f1_comp_balance_db
f1_utltyplnt_smmry
f1_plant_in_srvce
f1_accumdepr_prvsn
Balance sheet (liabilities) breakdown
f1_bal_sheet_cr
f1_retained_erng
Income statement breakdown
f1_income_stmnt
f1_incm_stmnt_2
f1_elctrc_oper_rev
f1_dacs_epda
f1_elc_op_mnt_expn
Plant tables, for both balance sheet and income statement breakdown, to be overwritten by ferc-eia-deprish in the future
f1_steam
f1_hydro
f1_pumped_storage
f1_gnrt_plant
For purchased power breakout:
f1_elctrc_erg_acct
f1_purchased_power
Additional table handed off to optimus:
f1_cash_flow

Additional table handed to transmission team:

f1_xmssn_line

Have looked at, might use in the near future:
f1_sales_by_sched
f1_othr_reg_assets
f1_othr_reg_liab

Tasks (to be turned into issues)
[x] #1579
[x] #1594
[x] #1595
[x] #1593
[x] #1629
[x] #1630

bendnorman commented 2 years ago

Can this epic be closed @zschira ?

zschira commented 2 years ago

Yeah this should be closed

catalyst-cooperative / pudl

Create infrastructure to extract data from FERC XBRL filings #1568

Background

Known irregularities

List of tables used by RMI

Respondent ids and names

Balance sheet (assets) breakdown

Balance sheet (liabilities) breakdown

Income statement breakdown

Plant tables, for both balance sheet and income statement breakdown, to be overwritten by ferc-eia-deprish in the future

For purchased power breakout:

Additional table handed off to optimus:

Additional table handed to transmission team:

Have looked at, might use in the near future:

Tasks (to be turned into issues)