Integrate FERC 2021-2022 data with new extractor

catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

https://catalyst.coop/pudl

MIT License

456 stars 106 forks source link

Integrate FERC 2021-2022 data with new extractor #2811

Closed jdangerx closed 8 months ago

jdangerx commented 10 months ago

Follow-up to #2810 :

Now that we've gotten the new, more comprehensive extractor working for 2021 data, we should try to get it working for both 2021 and 2022 data.

Note: we're starting work on this before #2810 is complete - there are still some test updates necessary there but we're working on those in parallel.

## Scope
- [x] `ferc_to_sqlite` tasks complete successfully, targeting 2021-2022 data;
- [x] `norm_ferc1` task group completes successfully with new archives of 2021-2022 data
- [x] data validation tests either pass or are updated
- [ ] https://github.com/catalyst-cooperative/ferc-xbrl-extractor/pull/151

Individual error classes below.

If you fix these errors but the assets still don't build, that's fine - we'll merge the fixes into this branch and catalog the new errors.

## Error messages
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2883
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2875
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2876
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2877
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2878
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2879
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2897
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2899
- [ ] Non-unique values in index for FERC1 assets with 2021-2022 XBRL data
- [ ] Ambiguous start/end values per year while converting instants to durations for FERC1 assets with 2021-2022 XBRL data
- [ ] Duplicate primary keys for `electric_energy_dispositions_ferc1` with 2021-2022 XBRL data

pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2158, in merge_instant_and_duration_tables_xbrl
    out_df = pd.concat(
             ^^^^^^^^^^

[ ] plants_small_ferc1
[ ] plants_steam_ferc1
[ ] plants_hydro_ferc1
[ ] plants_pumped_storage_ferc1

AssertionError: Looks like there are multiple entries per year--not sure which to use for the start/end balance. params=UnstackBalancesToReportYearInstantXbrl(unstack_balances_to_report_year=True) primary_key_cols=['report_year', 'entity_id', 'electric_plant_classification_axis', 'utility_type_axis']
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
    raise AssertionError(

[ ] balance_sheet_assets_ferc1
[ ] balance_sheet_liabilities_ferc1
[ ] cash_flow_ferc1
[ ] electric_plant_depreciation_changes_ferc1
[ ] electric_plant_depreciation_functional_ferc1
[ ] other_regulatory_liabilities_ferc1
[ ] retained_earnings_ferc1

ValueError: electric_energy_dispositions_ferc1 Duplicate primary keys when enforcing schema.
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/metadata/classes.py", line 1506, in enforce_schema
    raise ValueError(

[ ] electric_energy_dispositions_ferc1

jdangerx commented 9 months ago

The good news is that xbrl2sqlite works with the 2022 data; the other news is that there are a handful of different errors to crank through:


File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
    df.loc[df.report_year == (df.year + 1), "balance_type"] = "starting_balance"
    ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set a frame with no defined index and a scalar

other_regulatory_liabilities_ferc1
cash_flow_ferc1
electric_plant_depreciation_changes_ferc1
balance_sheet_assets_ferc1
electric_plant_depreciation_functional_ferc1
retained_earnings_ferc1
balance_sheet_liabilities_ferc1
plant_in_service_ferc1

AssertionError: Unexpected number of columns dropped in wide_to_tidy:

electric_energy_dispositions_ferc1 (27 instead of 19)
electric_operating_expenses_ferc1 (171 instead of 0)
electric_energy_sources_ferc1 (27 instead of 10)
income_statement_ferc1 (70 instead of 2)
electric_operating_revenues_ferc1 (48 instead of 2)

File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 1236, in transform_end
    raise ValueError(

ValueError: plants_hydro_ferc1: Column capex_per_mw is entirely NULL!

(or, you know, other null columns)

plants_hydro_ferc1
electricity_sales_by_rate_schedule_ferc1

File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 287, in wide_to_tidy
    df_out = df.set_index(params.idx_cols).filter(regex=pat)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

KeyError: "None of ['plant_function'] are in the columns"

(or other columns not in index)

depreciation_amortization_summary_ferc1
utility_plant_summary_ferc1

File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2899, in aggregate_duplicate_fuel_types_xbrl
    agg_row_fraction = (len(fuel_pk_dupes) + len(fuel_multi_unit)) / len(fuel_xbrl)
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

ZeroDivisionError: division by zero

fuel_ferc1

jdangerx commented 9 months ago

It looks like we have some remapping to do, to get valid 2022 data into the pipeline.

For starters, we have the following 4 entity_ids in the XBRL data that don't appear in our mapping yet:

    entity_id                    respondent_legal_name
176   C004679                 Transource Oklahoma, LLC
217   C011644          Pioneer Power and Light Company
218   C011745                     PJM Settlement, Inc.
219   C011785  Mountrail-Williams Electric Cooperative

I ran some simple difflib.get_close_matches() between these and the respondent_names in the DBF data, to get this list of potential matches. I used the identification_001_duration and f1_respondent_id tables in the XBRL / DBF data respectively.

  entity_id                                xbrl_name                                dbf_name  respondent_id
0   C004679                 Transource Oklahoma, LLC                Transource Maryland, LLC            521
1   C004679                 Transource Oklahoma, LLC            Transource Pennsylvania, LLC            520
2   C004679                 Transource Oklahoma, LLC                Transource Missouri, LLC            453
3   C011644          Pioneer Power and Light Company         Pioneer Power and Light Company            140
4   C011644          Pioneer Power and Light Company       Wisconsin Power and Light Company            194
5   C011644          Pioneer Power and Light Company      Interstate Power and Light Company            281
6   C011745                     PJM Settlement, Inc.                    PJM Settlement, Inc.            447
7   C011785  Mountrail-Williams Electric Cooperative         Lyon Rural Electric Cooperative            412
8   C011785  Mountrail-Williams Electric Cooperative  Southern Maryland Electric Cooperative            296
9   C011785  Mountrail-Williams Electric Cooperative             Oconto Electric Cooperative            205

Case-by-case:

C004679: looks like get_close_matches picked up on Transource being the most relevant thing here, and it looks like there is indeed no DBF respondent that matches with Transource Oklahoma - we should add this to the utility_id_ferc1.csv, with no corresponding DBF ID.
C011644: There's an exact match, with DBF respondent ID 140! Hooray! However, DBF respondent ID 140 is already mapped to another ID:
```
   utility_id_ferc1 utility_id_ferc1_xbrl  utility_id_ferc1_dbf
403               414               R001140                   140
3                <NA>               C011644                   140
```
Weirdly, R001140 doesn't appear to show up anywhere in the XBRL identification_001_duration table. So we could probably just edit the utility_id_ferc1.csv to map DBF 140 to XBRL C011644 and PUDL 414.

C011745 : Another exact match, this time with DBF respondent ID 447! Also, this is already mapped to another ID too:

   utility_id_ferc1 utility_id_ferc1_xbrl  utility_id_ferc1_dbf
152               153               C000030                   447
6                <NA>               C011745                   447

And unfortunately, both of these entity IDs show up in our XBRL table attached to PJM!

 utility_id_ferc1 utility_id_ferc1_xbrl  utility_id_ferc1_dbf                        xbrl_name                         dbf_name
0               153               C000030                   447         PJM Interconnection, LLC             PJM Settlement, Inc.
1              <NA>               C011745                   447             PJM Settlement, Inc.             PJM Settlement, Inc.

But also, both PJM Interconnection and PJM Settlement have their own DBF ID, so I'm not sure why we mapped C000030 to 447 in the first place...

   respondent_id           respondent_name
245            255  PJM Interconnection, LLC
354            447      PJM Settlement, Inc.

C011785: the "closest matches" seem to be all different electric co-ops that seem unrelated to Mountrail-Williams so this looks like another straightforward addition to the map. I also looked at all DBF respondent_names that started with M and didn't find anything, so I feel confident that this utility just never reported in DBF format.

My questions to those who are more experienced in the mapping arena (@aesharpe ? @zaneselvans ? @katie-lamb ? ) are:

Is it cool to remap DBF 140 to XBRL C011644 instead of XBRL R001140 for Pioneer?
Is it cool to map C000030 to 255 and C011745 to 447 so PJM Interconnection and PJM Settlement are both mapped to themselves across the XBRL/DBF barrier?
Is there anywhere else I should be looking for DBF IDs for Transource Oklahoma and Mountrail-Williams?

jdangerx commented 9 months ago

Well the good news is, I put in my draft remapping of XBRL -> DBF utility IDs and now more stuff builds...

diff --git a/src/pudl/package_data/glue/utility_id_ferc1.csv b/src/pudl/package_data/glue/utility_id_ferc1.csv
index cde4d3f71..a0623757c 100644
--- a/src/pudl/package_data/glue/utility_id_ferc1.csv
+++ b/src/pudl/package_data/glue/utility_id_ferc1.csv
@@ -151,7 +151,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 150,,1001
 151,,1003
 152,C000029,231
-153,C000030,447
+153,C000030,255
 154,C000038,250
 155,C000041,161
 156,C000045,294
@@ -402,7 +402,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 408,R001052,52
 409,R001072,72
 411,R001087,87
-414,R001140,140
+414,C011644,140
 419,R001268,268
 420,R001298,298
 421,R001301,301
@@ -412,10 +412,12 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 428,R001419,419
 429,R001422,422
 436,R001445,445
-437,,255
 439,R001515,515
 440,R001520,520
 441,R001521,521
 442,,278
 443,C010845,
 444,C011304,
+445,C011745,447
+446,C004679,
+447,C011785,
\ No newline at end of file

Now I only get two failures in the core asset group:

fuel_ferc1:

ValueError: fuel_ferc1 (xbrl): Found null primary key values.
report_year              False
utility_id_ferc1_xbrl    False
fuel_type_code_pudl       True
plant_name_ferc1         False
dtype: bool

And retained_earnings_ferc1:

ValueError: not enough values to unpack (expected 1, got 0)
...
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 4680, in add_previous_year_factoid
    [missing_year] = [
    ^^^^^^^^^^^^^^

e-belfer commented 9 months ago

@jdangerx see my PR re: the fuel_type_code null value failure, it's a simple fix: https://github.com/catalyst-cooperative/pudl/pull/2881

jdangerx commented 8 months ago

OK, some notes on the validation tests:

current status

These are the two main things that I'm seeing:

some rows from 2021 are being dropped - this is likely due to an issue where the data for 2021 isn't associated with the 2021 report_year due to a missing ReportYear fact in the XBRL report. I'll use some trickery to turn the report_year into the year of the
in plants_steam, we're seeing ~750 plants being changed for every year, including DBF-only years. what's going on there?

next steps

use some trickery to fix the report year, so this doesn't happen anymore:

  entity_id                                        filing_name  report_year  start_date    end_date  coolants_and_water
300   C000041  Southern_California_Edison_Company_form1_Q4_16...         2021  2020-01-01  2020-12-31           7649530.0
955   C000041  Southern_California_Edison_Company_form1_Q4_16...         2022  2022-01-01  2022-12-31           9136620.0
954   C000041  Southern_California_Edison_Company_form1_Q4_16...         2022  2021-01-01  2021-12-31           7890927.0

figure out what the deal is with all the changes in plants_steam

### fbp_ferc1 Found 26112 rows, expected 25406. Off by 2.779%, allowed margin of 0.000% Table name: denorm_fuel_by_plant_ferc1 * Total 706 new rows: * 689 new rows in 2022, 17 new rows from 2021 - looks like duke energy carolinas + entergy new orleans had some updates * 2 rows updated: 1 from duke energy carolinas + 1 from MS power company * most have fuel_cost and fuel_mmbtu; primary fraction, gas fraction * some have oil/coal fractions * few have nuclear and waste fractions; * the gas_fraction_mmbtu distributions look similar between new data & all old data ### fuel_ferc1 Found 49932 rows, expected 48815. Off by 2.288%, allowed margin of 0.000% Table name: denorm_fuel_ferc1 * 1117 new rows, nothing dropped from old tables, no updates. * report year all 2022 * looks like fuel cost per mmbtu has similar shapes to *all* years but generally things are more expensive, that seems reasonable ### plant_in_service_ferc1 Found 335365 rows, expected 315112. Off by 6.427%, allowed margin of 0.000% Table name: denorm_plant_in_service_ferc1 * 20,937 new rows * almost all from 2022 * 8,649 rows that changed * all from 2021 * these are rows where some 2021 values became null when reported with the 2022 data. * there is at least once instance where Vermont Electric *Power* Company reported a value for 2021 in the 2021 filing, that then shows up in VETCO's 2021 value in the 2022 filing - these seem like corrections. * 684 rows dropped * mostly (~80/y) from 1998-2003, but a small amount each other year. * this is due to some utility ID remapping to differentiate PJM Settlement, Inc. from PJM Interconnection, LLC. ### plants_all_ferc1 Found 56312 rows, expected 54415. Off by 3.486%, allowed margin of 0.000% Table name: denorm_plants_all_ferc1 * 2028 new rows from 2022 * 131 rows dropped * 21456 rows updated * ... about 750 from every year, and 900 from 2021. * Seems like most of them are getting values that were NA before... but from where? ### plants_hydro_ferc1 Found 6979 rows, expected 6798. Off by 2.663%, allowed margin of 0.000% Table name: denorm_plants_hydro_ferc1 * 183 new rows, all from 2022, with lots of non-null values * 2 dropped rows from 2021 - but they look like weird fake plants: * `Massachusetts Electric Company/"n/a"` where all values are NaN or None except pudl plant ID * `ALABAMA POWER COMPANY/"rain gauge"` where all values are NaN or None except pudl plant ID, capex_per_mw = 0.0, opex_per_mwh = 0.0, project_num = 0.0 ### plants_pumped_storage_ferc1 Found 562 rows, expected 544. Off by 3.309%, allowed margin of 0.000% Table name: denorm_plants_pumped_storage_ferc1 * 18 new entries which are mostly non-NA, from 2022 * no dropped/updated entries ### plants_small_ferc1 Found 16988 rows, expected 16248. Off by 4.554%, allowed margin of 0.000% Table name: denorm_plants_small_ferc1 * 754 new entries - 752 from 2022, 2 from 2021 * non-null counts seem reasonable, not a lot of `license_id_ferc1` values, but that seems par for the course. * 14 dropped entries * seem like values that were erased in the 2021-2022 version of the 2021 data * 141 changed entries * there is one value here that seem like they were updated in the DBF data? * the rest look like updates from 2021 data that is being reinterpreted via the new dedupe structure ### plants_steam_ferc1 Found 31783 rows, expected 30825. Off by 3.108%, allowed margin of 0.000% Table name: denorm_plants_steam_ferc1 * 1073 new rows - 1070 from 2022, 3 from 2021 * 115 dropped rows, all from 2021 * looks like maybe the `ReportYear` fact isn't reported for all 2021 data in the 2022 filings? e.g. for Southern California Edison, `cr-00603` is a context that refers to 2021 data, but it doesn't have a ReportYear fact - only `cr-00604` has a `ReportYear` fact (which is 2022, of course). * 21315 changed rows * follows the same pattern as plants_all_ferc1 - ~750/year and about 900 for 2021. I guess this is the source of that discrepancy. ### purchased_power_ferc1 Found 204720 rows, expected 197947. Off by 3.422%, allowed margin of 0.000% Table name: denorm_purchased_power_ferc1 * 7122 new rows, of which 7052 are from 2022 and 70 are from 2021 * 340 dropped rows, all from 2021 * 0 updated rows

jdangerx commented 8 months ago

There are 26,113 individual facts which changed in the plants_steam table - of which 21,065 are changes where the plant_id_ferc1 changed number between the dev build and the build on this branch.

Of the remaining changes, we see a lot of "DBF data for a specific plant went from non-NA to NA," and a lot of "XBRL data for 2021 went from NA to not-NA." So it seems like we got more valid data for 2021, which is suspicious due to the aforementioned report_year bug. And we lost a bunch of data from before 2021! Which is bad!

I have a few questions:

Why did the plant_id_ferc1s change so much?
Why did we lose so much data from the DBF years? Is it related to the plant ID changes? If not, what else did I change that could have messed up this process?

I think the next steps are still:

fix the report_year bug, so we can better understand how the 2021 data has changed without that confounding factor
continue investigating the issues with the pre-2021 years - how does plant_id_ferc1 get assigned? why would it change? would that impact downstream data somehow?

zaneselvans commented 8 months ago

plant_id_ferc1 is algorithmically assigned based on the cosine similarity between records, and the numbers themselves aren't expected to be stable if the data changes, so this by itself isn't surprising or concerning. The addition of a new year of data would be expected to change the clustering of records that end up being associated with each other and given the same ID.

If there are changes to the data, that could also be affecting the IDs that get assigned, if any of the columns which are used to calculate the cosine similarity are changing. This is the matching problem that @zschira is about to swap in a much faster algorithm for (though it should have the same results I think, as it's just a faster way to calculate only the top-K cosine similarities I think, rather than the entire gigantic matrix)

Is there any pattern to the DBF data that's getting nullified?

zaneselvans commented 8 months ago

The code that does the FERC 1 plant ID assignment is in pudl.analysis.classify_plants_ferc1

jdangerx commented 8 months ago

Woohoo! Turns out that if you ignore the plant_id_ferc1, the big data discrepancies in the plants_steam/all tables just go away!

There's still some shuffling around in the 2021 data but I think that is related to #2947 - I'll fix up the PR and re-run ETL to see if that fixes things.

jdangerx commented 8 months ago

After applying the fix from #2947 , all the ferc1 tables have only the following sorts of changes:

new rows from 2022
updated rows from 2021 - due to report year fix
updated rows from < 2021 - due to utility remapping or FERC plant ID reassignment
- in the case of denorm_plants_steam_ferc1, there some changes in the annual capex additions because we calculated this by grouping on FERC plant ID - order of a couple dozen per year.

So I updated the minmax row counts in validate/ferc1_test.py.

There is still a validation test failure in validate/plants_steam_ferc1_test.py:

FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.0) is below lower bound (0.5) in validation entitled Capability R...

Which indicates that too many plants have a 0 capability ratio.

I think this is because the XBRL plants inherently have a much higher % of plants with a 0 capability ratio - see the table below. These 0 values are traceable directly to the raw XBRL files we're getting, so it's not a quirk of our processing.

years	old	new
2021
2022	n/a
all
xbrl-only
dbf-only

My pitch is to update the validation test so that it filters out the 0-values from 2021 and beyond before asserting stuff about the distribution. WDYT @zaneselvans ?

zaneselvans commented 8 months ago

@jdangerx Conceptually that sounds right, but I don't know if that's an easy thing to do within the structure of the validation tests.

jdangerx commented 8 months ago

I just updated the "query" - it was easy!

diff --git a/src/pudl/validate.py b/src/pudl/validate.py
index 3fe248f68..7e3d7f79d 100644
--- a/src/pudl/validate.py
+++ b/src/pudl/validate.py
@@ -832,9 +832,9 @@ plants_steam_ferc1_capacity_ratios = [
         "data_col": "capability_ratio",
         "weight_col": "",
     },
-    {
+    {  # XBRL data (post-2021) reports 0 capability for ~22% of plants, so we exclude.
         "title": "Capability Ratio (tails)",
-        "query": "",
+        "query": "report_year < 2021 | plant_capability_mw > 0",
         "low_q": 0.05,
         "low_bound": 0.5,
         "hi_q": 0.95,

zaneselvans commented 8 months ago

Ah okay great!

jrea-rmi commented 8 months ago

RMI would very much like to have the zero plant_capability_mw plant/generator/units included rather than dropped.

Reviewing the raw steam_electric_generating_plant_statistics_large_plants_402_duration, there are lots of zero cases that have valuable capital and operating cost data that we'd like to use.

So I suggest increasing the validation limit rather than dropping data.

e-belfer commented 8 months ago

I believe the change is just to filter what's getting validated to ignore 0 records, rather than dropping actual data in the database @jrea-rmi. Here's the relevant commit.

@zaneselvans Let me know if I'm misrepresenting this, though!

jrea-rmi commented 8 months ago

oo that'd be an important difference. Thanks @e-belfer

zaneselvans commented 8 months ago

Yes, that's my understanding as well @e-belfer.