Closed jdangerx closed 8 months ago
The good news is that xbrl2sqlite
works with the 2022 data; the other news is that there are a handful of different errors to crank through:
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 609, in unstack_balances_to_report_year_instant_xbrl
df.loc[df.report_year == (df.year + 1), "balance_type"] = "starting_balance"
~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot set a frame with no defined index and a scalar
AssertionError: Unexpected number of columns dropped
in wide_to_tidy
:
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 1236, in transform_end
raise ValueError(
ValueError: plants_hydro_ferc1: Column capex_per_mw is entirely NULL!
(or, you know, other null columns)
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 287, in wide_to_tidy
df_out = df.set_index(params.idx_cols).filter(regex=pat)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: "None of ['plant_function'] are in the columns"
(or other columns not in index)
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 2899, in aggregate_duplicate_fuel_types_xbrl
agg_row_fraction = (len(fuel_pk_dupes) + len(fuel_multi_unit)) / len(fuel_xbrl)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
ZeroDivisionError: division by zero
It looks like we have some remapping to do, to get valid 2022 data into the pipeline.
For starters, we have the following 4 entity_id
s in the XBRL data that don't appear in our mapping yet:
entity_id respondent_legal_name
176 C004679 Transource Oklahoma, LLC
217 C011644 Pioneer Power and Light Company
218 C011745 PJM Settlement, Inc.
219 C011785 Mountrail-Williams Electric Cooperative
I ran some simple difflib.get_close_matches()
between these and the respondent_name
s in the DBF data, to get this list of potential matches. I used the identification_001_duration
and f1_respondent_id
tables in the XBRL / DBF data respectively.
entity_id xbrl_name dbf_name respondent_id
0 C004679 Transource Oklahoma, LLC Transource Maryland, LLC 521
1 C004679 Transource Oklahoma, LLC Transource Pennsylvania, LLC 520
2 C004679 Transource Oklahoma, LLC Transource Missouri, LLC 453
3 C011644 Pioneer Power and Light Company Pioneer Power and Light Company 140
4 C011644 Pioneer Power and Light Company Wisconsin Power and Light Company 194
5 C011644 Pioneer Power and Light Company Interstate Power and Light Company 281
6 C011745 PJM Settlement, Inc. PJM Settlement, Inc. 447
7 C011785 Mountrail-Williams Electric Cooperative Lyon Rural Electric Cooperative 412
8 C011785 Mountrail-Williams Electric Cooperative Southern Maryland Electric Cooperative 296
9 C011785 Mountrail-Williams Electric Cooperative Oconto Electric Cooperative 205
Case-by-case:
C004679
: looks like get_close_matches
picked up on Transource
being the most relevant thing here, and it looks like there is indeed no DBF respondent that matches with Transource Oklahoma
- we should add this to the utility_id_ferc1.csv
, with no corresponding DBF ID.C011644
: There's an exact match, with DBF respondent ID 140! Hooray! However, DBF respondent ID 140 is already mapped to another ID:
utility_id_ferc1 utility_id_ferc1_xbrl utility_id_ferc1_dbf
403 414 R001140 140
3 <NA> C011644 140
Weirdly, R001140 doesn't appear to show up anywhere in the XBRL identification_001_duration
table. So we could probably just edit the utility_id_ferc1.csv
to map DBF 140 to XBRL C011644 and PUDL 414.
C011745
: Another exact match, this time with DBF respondent ID 447! Also, this is already mapped to another ID too:
utility_id_ferc1 utility_id_ferc1_xbrl utility_id_ferc1_dbf
152 153 C000030 447
6 <NA> C011745 447
And unfortunately, both of these entity IDs show up in our XBRL table attached to PJM!
utility_id_ferc1 utility_id_ferc1_xbrl utility_id_ferc1_dbf xbrl_name dbf_name
0 153 C000030 447 PJM Interconnection, LLC PJM Settlement, Inc.
1 <NA> C011745 447 PJM Settlement, Inc. PJM Settlement, Inc.
But also, both PJM Interconnection
and PJM Settlement
have their own DBF ID, so I'm not sure why we mapped C000030
to 447
in the first place...
respondent_id respondent_name
245 255 PJM Interconnection, LLC
354 447 PJM Settlement, Inc.
C011785
: the "closest matches" seem to be all different electric co-ops that seem unrelated to Mountrail-Williams
so this looks like another straightforward addition to the map. I also looked at all DBF respondent_name
s that started with M
and didn't find anything, so I feel confident that this utility just never reported in DBF format.My questions to those who are more experienced in the mapping arena (@aesharpe ? @zaneselvans ? @katie-lamb ? ) are:
C000030
to 255
and C011745
to 447
so PJM Interconnection
and PJM Settlement
are both mapped to themselves across the XBRL/DBF barrier?Transource Oklahoma
and Mountrail-Williams
?Well the good news is, I put in my draft remapping of XBRL -> DBF utility IDs and now more stuff builds...
diff --git a/src/pudl/package_data/glue/utility_id_ferc1.csv b/src/pudl/package_data/glue/utility_id_ferc1.csv
index cde4d3f71..a0623757c 100644
--- a/src/pudl/package_data/glue/utility_id_ferc1.csv
+++ b/src/pudl/package_data/glue/utility_id_ferc1.csv
@@ -151,7 +151,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
150,,1001
151,,1003
152,C000029,231
-153,C000030,447
+153,C000030,255
154,C000038,250
155,C000041,161
156,C000045,294
@@ -402,7 +402,7 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
408,R001052,52
409,R001072,72
411,R001087,87
-414,R001140,140
+414,C011644,140
419,R001268,268
420,R001298,298
421,R001301,301
@@ -412,10 +412,12 @@ utility_id_ferc1,utility_id_ferc1_xbrl,utility_id_ferc1_dbf
428,R001419,419
429,R001422,422
436,R001445,445
-437,,255
439,R001515,515
440,R001520,520
441,R001521,521
442,,278
443,C010845,
444,C011304,
+445,C011745,447
+446,C004679,
+447,C011785,
\ No newline at end of file
Now I only get two failures in the core asset group:
fuel_ferc1:
ValueError: fuel_ferc1 (xbrl): Found null primary key values.
report_year False
utility_id_ferc1_xbrl False
fuel_type_code_pudl True
plant_name_ferc1 False
dtype: bool
And retained_earnings_ferc1:
ValueError: not enough values to unpack (expected 1, got 0)
...
File "/Users/dazhong-catalyst/work/pudl/src/pudl/transform/ferc1.py", line 4680, in add_previous_year_factoid
[missing_year] = [
^^^^^^^^^^^^^^
@jdangerx see my PR re: the fuel_type_code
null value failure, it's a simple fix: https://github.com/catalyst-cooperative/pudl/pull/2881
OK, some notes on the validation tests:
These are the two main things that I'm seeing:
some rows from 2021 are being dropped - this is likely due to an issue where the data for 2021 isn't associated with the 2021 report_year due to a missing ReportYear
fact in the XBRL report. I'll use some trickery to turn the report_year
into the year of the
in plants_steam
, we're seeing ~750 plants being changed for every year, including DBF-only years. what's going on there?
entity_id filing_name report_year start_date end_date coolants_and_water
300 C000041 Southern_California_Edison_Company_form1_Q4_16... 2021 2020-01-01 2020-12-31 7649530.0
955 C000041 Southern_California_Edison_Company_form1_Q4_16... 2022 2022-01-01 2022-12-31 9136620.0
954 C000041 Southern_California_Edison_Company_form1_Q4_16... 2022 2021-01-01 2021-12-31 7890927.0
plants_steam
There are 26,113 individual facts which changed in the plants_steam table - of which 21,065 are changes where the plant_id_ferc1
changed number between the dev
build and the build on this branch.
Of the remaining changes, we see a lot of "DBF data for a specific plant went from non-NA to NA," and a lot of "XBRL data for 2021 went from NA to not-NA." So it seems like we got more valid data for 2021, which is suspicious due to the aforementioned report_year
bug. And we lost a bunch of data from before 2021! Which is bad!
I have a few questions:
I think the next steps are still:
report_year
bug, so we can better understand how the 2021 data has changed without that confounding factorplant_id_ferc1
get assigned? why would it change? would that impact downstream data somehow?plant_id_ferc1
is algorithmically assigned based on the cosine similarity between records, and the numbers themselves aren't expected to be stable if the data changes, so this by itself isn't surprising or concerning. The addition of a new year of data would be expected to change the clustering of records that end up being associated with each other and given the same ID.
If there are changes to the data, that could also be affecting the IDs that get assigned, if any of the columns which are used to calculate the cosine similarity are changing. This is the matching problem that @zschira is about to swap in a much faster algorithm for (though it should have the same results I think, as it's just a faster way to calculate only the top-K cosine similarities I think, rather than the entire gigantic matrix)
Is there any pattern to the DBF data that's getting nullified?
The code that does the FERC 1 plant ID assignment is in pudl.analysis.classify_plants_ferc1
Woohoo! Turns out that if you ignore the plant_id_ferc1
, the big data discrepancies in the plants_steam/all
tables just go away!
There's still some shuffling around in the 2021 data but I think that is related to #2947 - I'll fix up the PR and re-run ETL to see if that fixes things.
After applying the fix from #2947 , all the ferc1 tables have only the following sorts of changes:
So I updated the minmax row counts in validate/ferc1_test.py
.
There is still a validation test failure in validate/plants_steam_ferc1_test.py
:
FAILED test/validate/plants_steam_ferc1_test.py::test_vs_bounds[ferc1_annual-capacity_ratios] - ValueError: 5% quantile (0.0) is below lower bound (0.5) in validation entitled Capability R...
Which indicates that too many plants have a 0 capability ratio.
I think this is because the XBRL plants inherently have a much higher % of plants with a 0 capability ratio - see the table below. These 0 values are traceable directly to the raw XBRL files we're getting, so it's not a quirk of our processing.
years | old | new |
---|---|---|
2021 | ||
2022 | n/a | |
all | ||
xbrl-only | ||
dbf-only |
My pitch is to update the validation test so that it filters out the 0-values from 2021 and beyond before asserting stuff about the distribution. WDYT @zaneselvans ?
@jdangerx Conceptually that sounds right, but I don't know if that's an easy thing to do within the structure of the validation tests.
I just updated the "query" - it was easy!
diff --git a/src/pudl/validate.py b/src/pudl/validate.py
index 3fe248f68..7e3d7f79d 100644
--- a/src/pudl/validate.py
+++ b/src/pudl/validate.py
@@ -832,9 +832,9 @@ plants_steam_ferc1_capacity_ratios = [
"data_col": "capability_ratio",
"weight_col": "",
},
- {
+ { # XBRL data (post-2021) reports 0 capability for ~22% of plants, so we exclude.
"title": "Capability Ratio (tails)",
- "query": "",
+ "query": "report_year < 2021 | plant_capability_mw > 0",
"low_q": 0.05,
"low_bound": 0.5,
"hi_q": 0.95,
Ah okay great!
RMI would very much like to have the zero plant_capability_mw
plant/generator/units included rather than dropped.
Reviewing the raw steam_electric_generating_plant_statistics_large_plants_402_duration
, there are lots of zero cases that have valuable capital and operating cost data that we'd like to use.
So I suggest increasing the validation limit rather than dropping data.
I believe the change is just to filter what's getting validated to ignore 0 records, rather than dropping actual data in the database @jrea-rmi. Here's the relevant commit.
@zaneselvans Let me know if I'm misrepresenting this, though!
oo that'd be an important difference. Thanks @e-belfer
Yes, that's my understanding as well @e-belfer.
Follow-up to #2810 :
Now that we've gotten the new, more comprehensive extractor working for 2021 data, we should try to get it working for both 2021 and 2022 data.
Note: we're starting work on this before #2810 is complete - there are still some test updates necessary there but we're working on those in parallel.
Individual error classes below.
If you fix these errors but the assets still don't build, that's fine - we'll merge the fixes into this branch and catalog the new errors.