catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Integrate FERC 2021-2022 data with new extractor #2811

Closed jdangerx closed 8 months ago

jdangerx commented 10 months ago

Follow-up to #2810 :

Now that we've gotten the new, more comprehensive extractor working for 2021 data, we should try to get it working for both 2021 and 2022 data.

Note: we're starting work on this before #2810 is complete - there are still some test updates necessary there but we're working on those in parallel.

## Scope
- [x] `ferc_to_sqlite` tasks complete successfully, targeting 2021-2022 data;
- [x] `norm_ferc1` task group completes successfully with new archives of 2021-2022 data
- [x] data validation tests either pass or are updated
- [ ] https://github.com/catalyst-cooperative/ferc-xbrl-extractor/pull/151

Individual error classes below.

If you fix these errors but the assets still don't build, that's fine - we'll merge the fixes into this branch and catalog the new errors.

## Error messages
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2883
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2875
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2876
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2877
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2878
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2879
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2897
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2899
- [ ] Non-unique values in index for FERC1 assets with 2021-2022 XBRL data
- [ ] Ambiguous start/end values per year while converting instants to durations for FERC1 assets with 2021-2022 XBRL data
- [ ] Duplicate primary keys for `electric_energy_dispositions_ferc1` with 2021-2022 XBRL data
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2158, in merge_instant_and_duration_tables_xbrl
    out_df = pd.concat(
             ^^^^^^^^^^
AssertionError: Looks like there are multiple entries per year--not sure which to use for the start/end balance. params=UnstackBalancesToReportYearInstantXbrl(unstack_balances_to_report_year=True) primary_key_cols=['report_year', 'entity_id', 'electric_plant_classification_axis', 'utility_type_axis']
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
    raise AssertionError(
ValueError: electric_energy_dispositions_ferc1 Duplicate primary keys when enforcing schema.
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/metadata/classes.py", line 1506, in enforce_schema
    raise ValueError(
jdangerx commented 9 months ago

The good news is that xbrl2sqlite works with the 2022 data; the other news is that there are a handful of different errors to crank through:


File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
    df.loc[df.report_year == (df.year + 1), "balance_type"] = "starting_balance"
    ~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set a frame with no defined index and a scalar

AssertionError: Unexpected number of columns dropped in wide_to_tidy:

File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 1236, in transform_end
    raise ValueError(

ValueError: plants_hydro_ferc1: Column capex_per_mw is entirely NULL!

(or, you know, other null columns)

File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 287, in wide_to_tidy
    df_out = df.set_index(params.idx_cols).filter(regex=pat)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

KeyError: "None of ['plant_function'] are in the columns"

(or other columns not in index)

File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2899, in aggregate_duplicate_fuel_types_xbrl
    agg_row_fraction = (len(fuel_pk_dupes) + len(fuel_multi_unit)) / len(fuel_xbrl)
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~

ZeroDivisionError: division by zero
jdangerx commented 9 months ago

It looks like we have some remapping to do, to get valid 2022 data into the pipeline.

For starters, we have the following 4 entity_ids in the XBRL data that don't appear in our mapping yet:

    entity_id                    respondent_legal_name
176   C004679                 Transource Oklahoma, LLC
217   C011644          Pioneer Power and Light Company
218   C011745                     PJM Settlement, Inc.
219   C011785  Mountrail-Williams Electric Cooperative

I ran some simple difflib.get_close_matches() between these and the respondent_names in the DBF data, to get this list of potential matches. I used the identification_001_duration and f1_respondent_id tables in the XBRL / DBF data respectively.

  entity_id                                xbrl_name                                dbf_name  respondent_id
0   C004679                 Transource Oklahoma, LLC                Transource Maryland, LLC            521
1   C004679                 Transource Oklahoma, LLC            Transource Pennsylvania, LLC            520
2   C004679                 Transource Oklahoma, LLC                Transource Missouri, LLC            453
3   C011644          Pioneer Power and Light Company         Pioneer Power and Light Company            140
4   C011644          Pioneer Power and Light Company       Wisconsin Power and Light Company            194
5   C011644          Pioneer Power and Light Company      Interstate Power and Light Company            281
6   C011745                     PJM Settlement, Inc.                    PJM Settlement, Inc.            447
7   C011785  Mountrail-Williams Electric Cooperative         Lyon Rural Electric Cooperative            412
8   C011785  Mountrail-Williams Electric Cooperative  Southern Maryland Electric Cooperative            296
9   C011785  Mountrail-Williams Electric Cooperative             Oconto Electric Cooperative            205

Case-by-case:

My questions to those who are more experienced in the mapping arena (@aesharpe ? @zaneselvans ? @katie-lamb ? ) are:

  1. Is it cool to remap DBF 140 to XBRL C011644 instead of XBRL R001140 for Pioneer?
  2. Is it cool to map C000030 to 255 and C011745 to 447 so PJM Interconnection and PJM Settlement are both mapped to themselves across the XBRL/DBF barrier?
  3. Is there anywhere else I should be looking for DBF IDs for Transource Oklahoma and Mountrail-Williams?
jdangerx commented 9 months ago

Well the good news is, I put in my draft remapping of XBRL -> DBF utility IDs and now more stuff builds...

diff --git a/src/pudl/package_data/glue/utility_id_ferc1.csv b/src/pudl/package_data/glue/utility_id_ferc1.csv
index cde4d3f71..a0623757c 100644
--- a/src/pudl/package_data/glue/utility_id_ferc1.csv
+++ b/src/pudl/package_data/glue/utility_id_ferc1.csv
@@ -151,7 +151,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 150,,1001
 151,,1003
 152,C000029,231
-153,C000030,447
+153,C000030,255
 154,C000038,250
 155,C000041,161
 156,C000045,294
@@ -402,7 +402,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 408,R001052,52
 409,R001072,72
 411,R001087,87
-414,R001140,140
+414,C011644,140
 419,R001268,268
 420,R001298,298
 421,R001301,301
@@ -412,10 +412,12 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
 428,R001419,419
 429,R001422,422
 436,R001445,445
-437,,255
 439,R001515,515
 440,R001520,520
 441,R001521,521
 442,,278
 443,C010845,
 444,C011304,
+445,C011745,447
+446,C004679,
+447,C011785,
\ No newline at end of file

Now I only get two failures in the core asset group:

fuel_ferc1:

ValueError: fuel_ferc1 (xbrl): Found null primary key values.
report_year              False
utility_id_ferc1_xbrl    False
fuel_type_code_pudl       True
plant_name_ferc1         False
dtype: bool

And retained_earnings_ferc1:

ValueError: not enough values to unpack (expected 1, got 0)
...
  File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 4680, in add_previous_year_factoid
    [missing_year] = [
    ^^^^^^^^^^^^^^
e-belfer commented 9 months ago

@jdangerx see my PR re: the fuel_type_code null value failure, it's a simple fix: https://github.com/catalyst-cooperative/pudl/pull/2881

jdangerx commented 8 months ago

OK, some notes on the validation tests:

current status

These are the two main things that I'm seeing:

next steps

### fbp_ferc1 Found 26112 rows, expected 25406. Off by 2.779%, allowed margin of 0.000% Table name: denorm_fuel_by_plant_ferc1 * Total 706 new rows: * 689 new rows in 2022, 17 new rows from 2021 - looks like duke energy carolinas + entergy new orleans had some updates * 2 rows updated: 1 from duke energy carolinas + 1 from MS power company * most have fuel_cost and fuel_mmbtu; primary fraction, gas fraction * some have oil/coal fractions * few have nuclear and waste fractions; * the gas_fraction_mmbtu distributions look similar between new data & all old data ### fuel_ferc1 Found 49932 rows, expected 48815. Off by 2.288%, allowed margin of 0.000% Table name: denorm_fuel_ferc1 * 1117 new rows, nothing dropped from old tables, no updates. * report year all 2022 * looks like fuel cost per mmbtu has similar shapes to *all* years but generally things are more expensive, that seems reasonable ### plant_in_service_ferc1 Found 335365 rows, expected 315112. Off by 6.427%, allowed margin of 0.000% Table name: denorm_plant_in_service_ferc1 * 20,937 new rows * almost all from 2022 * 8,649 rows that changed * all from 2021 * these are rows where some 2021 values became null when reported with the 2022 data. * there is at least once instance where Vermont Electric *Power* Company reported a value for 2021 in the 2021 filing, that then shows up in VETCO's 2021 value in the 2022 filing - these seem like corrections. * 684 rows dropped * mostly (~80/y) from 1998-2003, but a small amount each other year. * this is due to some utility ID remapping to differentiate PJM Settlement, Inc. from PJM Interconnection, LLC. ### plants_all_ferc1 Found 56312 rows, expected 54415. Off by 3.486%, allowed margin of 0.000% Table name: denorm_plants_all_ferc1 * 2028 new rows from 2022 * 131 rows dropped * 21456 rows updated * ... about 750 from every year, and 900 from 2021. * Seems like most of them are getting values that were NA before... but from where? ### plants_hydro_ferc1 Found 6979 rows, expected 6798. Off by 2.663%, allowed margin of 0.000% Table name: denorm_plants_hydro_ferc1 * 183 new rows, all from 2022, with lots of non-null values * 2 dropped rows from 2021 - but they look like weird fake plants: * `Massachusetts Electric Company/"n/a"` where all values are NaN or None except pudl plant ID * `ALABAMA POWER COMPANY/"rain gauge"` where all values are NaN or None except pudl plant ID, capex_per_mw = 0.0, opex_per_mwh = 0.0, project_num = 0.0 ### plants_pumped_storage_ferc1 Found 562 rows, expected 544. Off by 3.309%, allowed margin of 0.000% Table name: denorm_plants_pumped_storage_ferc1 * 18 new entries which are mostly non-NA, from 2022 * no dropped/updated entries ### plants_small_ferc1 Found 16988 rows, expected 16248. Off by 4.554%, allowed margin of 0.000% Table name: denorm_plants_small_ferc1 * 754 new entries - 752 from 2022, 2 from 2021 * non-null counts seem reasonable, not a lot of `license_id_ferc1` values, but that seems par for the course. * 14 dropped entries * seem like values that were erased in the 2021-2022 version of the 2021 data * 141 changed entries * there is one value here that seem like they were updated in the DBF data? * the rest look like updates from 2021 data that is being reinterpreted via the new dedupe structure ### plants_steam_ferc1 Found 31783 rows, expected 30825. Off by 3.108%, allowed margin of 0.000% Table name: denorm_plants_steam_ferc1 * 1073 new rows - 1070 from 2022, 3 from 2021 * 115 dropped rows, all from 2021 * looks like maybe the `ReportYear` fact isn't reported for all 2021 data in the 2022 filings? e.g. for Southern California Edison, `cr-00603` is a context that refers to 2021 data, but it doesn't have a ReportYear fact - only `cr-00604` has a `ReportYear` fact (which is 2022, of course). * 21315 changed rows * follows the same pattern as plants_all_ferc1 - ~750/year and about 900 for 2021. I guess this is the source of that discrepancy. ### purchased_power_ferc1 Found 204720 rows, expected 197947. Off by 3.422%, allowed margin of 0.000% Table name: denorm_purchased_power_ferc1 * 7122 new rows, of which 7052 are from 2022 and 70 are from 2021 * 340 dropped rows, all from 2021 * 0 updated rows
jdangerx commented 8 months ago

There are 26,113 individual facts which changed in the plants_steam table - of which 21,065 are changes where the plant_id_ferc1 changed number between the dev build and the build on this branch.

Of the remaining changes, we see a lot of "DBF data for a specific plant went from non-NA to NA," and a lot of "XBRL data for 2021 went from NA to not-NA." So it seems like we got more valid data for 2021, which is suspicious due to the aforementioned report_year bug. And we lost a bunch of data from before 2021! Which is bad!

I have a few questions:

  1. Why did the plant_id_ferc1s change so much?
  2. Why did we lose so much data from the DBF years? Is it related to the plant ID changes? If not, what else did I change that could have messed up this process?

I think the next steps are still:

  1. fix the report_year bug, so we can better understand how the 2021 data has changed without that confounding factor
  2. continue investigating the issues with the pre-2021 years - how does plant_id_ferc1 get assigned? why would it change? would that impact downstream data somehow?
zaneselvans commented 8 months ago

plant_id_ferc1 is algorithmically assigned based on the cosine similarity between records, and the numbers themselves aren't expected to be stable if the data changes, so this by itself isn't surprising or concerning. The addition of a new year of data would be expected to change the clustering of records that end up being associated with each other and given the same ID.

If there are changes to the data, that could also be affecting the IDs that get assigned, if any of the columns which are used to calculate the cosine similarity are changing. This is the matching problem that @zschira is about to swap in a much faster algorithm for (though it should have the same results I think, as it's just a faster way to calculate only the top-K cosine similarities I think, rather than the entire gigantic matrix)

Is there any pattern to the DBF data that's getting nullified?

zaneselvans commented 8 months ago

The code that does the FERC 1 plant ID assignment is in pudl.analysis.classify_plants_ferc1

jdangerx commented 8 months ago

Woohoo! Turns out that if you ignore the plant_id_ferc1, the big data discrepancies in the plants_steam/all tables just go away!

There's still some shuffling around in the 2021 data but I think that is related to #2947 - I'll fix up the PR and re-run ETL to see if that fixes things.

jdangerx commented 8 months ago

After applying the fix from #2947 , all the ferc1 tables have only the following sorts of changes:

So I updated the minmax row counts in validate/ferc1_test.py.

There is still a validation test failure in validate/plants_steam_ferc1_test.py:

FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.0) is below lower bound (0.5) in validation entitled Capability R...    

Which indicates that too many plants have a 0 capability ratio.

I think this is because the XBRL plants inherently have a much higher % of plants with a 0 capability ratio - see the table below. These 0 values are traceable directly to the raw XBRL files we're getting, so it's not a quirk of our processing.

years old new
2021 image image
2022 n/a image
all image image
xbrl-only image image
dbf-only image image

My pitch is to update the validation test so that it filters out the 0-values from 2021 and beyond before asserting stuff about the distribution. WDYT @zaneselvans ?

zaneselvans commented 8 months ago

@jdangerx Conceptually that sounds right, but I don't know if that's an easy thing to do within the structure of the validation tests.

jdangerx commented 8 months ago

I just updated the "query" - it was easy!

diff --git a/src/pudl/validate.py b/src/pudl/validate.py
index 3fe248f68..7e3d7f79d 100644
--- a/src/pudl/validate.py
+++ b/src/pudl/validate.py
@@ -832,9 +832,9 @@ plants_steam_ferc1_capacity_ratios = [
         "data_col": "capability_ratio",
         "weight_col": "",
     },
-    {
+    {  # XBRL data (post-2021) reports 0 capability for ~22% of plants, so we exclude.
         "title": "Capability Ratio (tails)",
-        "query": "",
+        "query": "report_year < 2021 | plant_capability_mw > 0",
         "low_q": 0.05,
         "low_bound": 0.5,
         "hi_q": 0.95,
zaneselvans commented 8 months ago

Ah okay great!

jrea-rmi commented 8 months ago

RMI would very much like to have the zero plant_capability_mw plant/generator/units included rather than dropped.

Reviewing the raw steam_electric_generating_plant_statistics_large_plants_402_duration, there are lots of zero cases that have valuable capital and operating cost data that we'd like to use.

So I suggest increasing the validation limit rather than dropping data.

e-belfer commented 8 months ago

I believe the change is just to filter what's getting validated to ignore 0 records, rather than dropping actual data in the database @jrea-rmi. Here's the relevant commit.

@zaneselvans Let me know if I'm misrepresenting this, though!

jrea-rmi commented 8 months ago

oo that'd be an important difference. Thanks @e-belfer

zaneselvans commented 8 months ago

Yes, that's my understanding as well @e-belfer.