catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

why is basically nothing in the rate base table from `core_ferc1__yearly_depreciation_by_function_sched219`in 2021? #3309

Closed cmgosnell closed 2 months ago

cmgosnell commented 5 months ago

see this comment that id-ed the problem. track it down & fix it. presumably its from the og table transform.

image

### Tasks
- [x] add standard subtotal_corrections
- [ ] add unit test for multi-dimension corrections
- [ ] rn the merge validation fails in `calculate_values_from_components`: can we just parameterize the validation? or do we need to selectively drop calc component records?
- [ ] make sure total (parent) -> null subtotal (child) corrections get made
- [ ] make sure null calculated values and/or null reported values actually get corrections
- [ ] do we need to iteratively make corrections (i.e. correct leafs, calculate parents w/ corrections, calculate grandparents /w corrections)
- [ ] do we want to make corrections on two dimensions or just choose one? (probably just choose one)
e-belfer commented 5 months ago

@cmgosnell: Out of 205 unique entities in the raw XBRL instant data, there are only 4 utilities that report data in the leafy "accumulated provision for depreciation of X" columns of the XBRL instant data in 2021. The rest report data in what is ultimately the "total" plant function of the "accumulated depreciation" factoid. For some reason, the corrections from subtotal to total in the plant_function sub-dimension are not getting propagated to the rate base table.

To see the values reported in this table:

out_ferc1__yearly_rate_base[(out_ferc1__yearly_rate_base.table_name == "core_ferc1__yearly_depreciation_by_function_sched219")&(out_ferc1__yearly_rate_base.report_year==2021)&(out_ferc1__yearly_rate_base.ending_balance.notnull())]

These are the same utilities observed in the transformed table:

from dagster import AssetKey

from pudl.etl import defs

asset_key = "core_ferc1__yearly_depreciation_by_function_sched219"
df = defs.load_asset_value(AssetKey(asset_key))

df[(df.report_year==2021)&(df.plant_function!="total")&(df.ending_balance.notnull())]

There are 0 corrections records in the rate base table for this table: df[(df.report_year==2021)].row_type_xbrl.value_counts()

row_type_xbrl
reported_value 4540
calculated_value 0
correction 0

That suggests the issue isn't pruning but rather actually happening in the generation of the correction records originally. Which makes sense because we currently aren't adding sub-total to total corrections, so we are effectively losing all "total" records in this table.

cmgosnell commented 3 months ago

@zaneselvans i think i would like your help in thinking about the error metrics bc i changed the abs_diff to include diffs of null values filled in with 0s on both the calculated and reported value sides (in this comment, really only on 1241). that way the correction records are much much more all encompassing. but this resulted in the metric checks failing because abs_diff is the main input to all of the metric checks and with this change there are way more calculated records with abs_diff values. The relative_error_magnitude is the main one that is now failing much more frequently. which makes sense because it is now including a lot of null vs some value.

We could force the metric checks to behave like they used to (i think), by somehow only checking these values when the reported and/or calculated values are the non-null. But I think it makes more sense to adjust the tolerance values because this new change actually feels more representative of the data itself.

jrea-rmi commented 3 months ago

Is this issue still outstanding?

I'm working on an update to our net_plance_balance table, which compiles values of original cost, accumulated depreciation, and the net plant balance by FERC technology then adds an estimate of renewables net plant balance as further detail of the "other" category. I'll then use that renewables breakout as further detail of your rate base table - but not expecting that you'll integrate our estimate of renewables because it's modeled rather than reported data.

I'm using core_ferc1__yearly_depreciation_by_function_sched219 for the accumulated depreciation data, but finding the same issue you have here that most 2021 data is missing.

I checked the raw xbrl table accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant and I see data values there for accumulated depreciation by function than don't propogate into the PUDL table core_ferc1__yearly_depreciation_by_function_sched219. So I don't know what the issue is, but I do think the raw data exists rather than is missing.

zaneselvans commented 3 months ago

Hey @jrea-rmi yes I think this is still outstanding, and work on it is happening in PR #3450

jrea-rmi commented 3 months ago

@cmgosnell that work seems to be focused on corrections. But I think there's an issue in coverting the raw xbrl table into the PUDL table core_ferc1__yearly_depreciation_by_function_sched219, a separate issue from calculation corrections.

cmgosnell commented 3 months ago

hey @jrea-rmi sorry for the delay on getting back to this! This is a good find without a super straightforward solution. I'm going to lay out what I understand to be the problem and a possible solution but i will want @jdangerx to verify my assessment here because your much more familiar with the raw xbrl data than I am.

The high level problem here is that FERC does not replace records when a utility comes back and files an update, so we filter for the most recent updated record (pudl.io_managers.FercXBRLSQLiteIOManager.filter_for_freshest_data). In this table, basically all of the records got an update that only contains the total records, whereas the previous update does include more detail.

If this loss of detail is a blocker, I could imaging making an exception for this this table/year only where we grab the slightly less fresh data. but only if a few things are true:

I'm going to try to answer the first two questions. but I'd love @jdangerx 's perspective on this!!

jrea-rmi commented 3 months ago

Great. This loss of detail is definitely important for us!

If there isn't an update to any of the detailed by function values, I sure hope the totals are the same. And if they are, then it seems reasonable to use the penultimate record. Would that be easier than combining updated/previous records/filings?

jdangerx commented 3 months ago

So for each XBRL context (utility name, report year, ... - the primary key, basically) we could have multiple sets of facts from multiple XBRL filings.

Currently we treat the most recent filing as completely authoritative - if the most recent filing reports everything as nulls, we just trust that that's more accurate.

It would be very easy to say "sort all the facts for this context by publication time, and take the latest non-null reported value for each."

The new way is preferable in cases like this, where the newer filings only contain the updated values.

But the old way is preferable if a value was erroneously reported as non-null, i.e. someone put a number in the wrong field.

I sort of think that cases like this are more common, so would be happy to switch the behavior over - but this might uncover some other weird behavior. I don't think there's a good blanket decision to make here.

To capture this data nuance, we could try to dynamically apply the two different approaches:

This gives us some broadly correct behavior while then allowing us to apply spot fixes if necessary... that refactor wouldn't be too big of a time sink, I think - the code is pretty easy to test and there's a natural place to pull out the 'deduplicate a group' logic. If we want to go for that I bet we could get it done in 5-10h - @cmgosnell I'm happy to crank on it if you have too much stuff on your plate.

cmgosnell commented 3 months ago

@jdangerx if you feel like you have capacity to take this on that would be swell! I would be a little reluctant to change the default behavior, but it might be a good idea. It sounds like it cooould end up being simpler.

And maybe i'm off on this but could this kind of nulls perhaps be a result of the sql-ification of the xbrl data? as in... can the utils re-report one specific fact? and we take that fact and slurp it into a db table and it looks like it has a bunch of nulls in the same record. I'm not sure if there is a way to distinguish a null bc its really null vs a null bc the respondent only updated a few values.

I think you have more context on how simple and/or appropriate applying either approach would be so I'm happy to defer to you on this one.

jdangerx commented 3 months ago

Did some more digging & thinking, screed ahead!

Here are the possible ways to deduplicate:

There are a bunch of tables with this data loss issue - if I switch from "last snapshot" to "first snapshot" dataset-wide, I see 3 tables where we have at least 1 completely null column in 2021 due to the "last snapshot" methodology. This is a lower bound on the data loss since I didn't look for more subtle forms.

The "first snapshot" approach means we miss out on all updates whether they are snapshot updates or diff updates. It also means we miss out on all the problems we have dealing with updated data and multiple filings, but I think the value of getting updated data outweighs that benefit.

This data loss we've found is due to applying "last snapshot" everywhere, even when the latest filing was a diff. So we shouldn't blanket apply that either.

We could apply "apply diffs" everywhere, which gets us the latest / most complete data possible. But...

We could apply "best snapshot" everywhere, which would keep the 1:1 mapping from table row to XBRL filing. But...

Weighing all these, my conclusions are:

My pitch is this:

To fix this data loss issue in the rate base table, the minimal change is the first bullet. The next two bullets are useful infrastructural investments which will make future XBRL-based work easier.

My current plan, then, is to make a small PR for the first bullet and file issues/put TODOs in the code for the other two. If we think that further investments are worthwhile right now I'm happy to take those on too - but that depends on our priorities, project budget, etc. @cmgosnell @jonrea do we have specific expectations for how much more time we want to spend on this project?

jrea-rmi commented 3 months ago

wow, great thinking here.

I agree that the best way to do this is the "apply diff" method, and that it will be important to be able to differentiate between non-reported facts and reported-null facts.

In terms of RMI priorities, this is at least near the top. For the next month or two, we want to keep our overall monthly budget within or below what it's been previously. Let's review that in our next check-in on April 5, but definitely worth putting some effort in between now and then.

jdangerx commented 3 months ago

After talking a bit more with @cmgosnell - let's do a little bit more investigation to better understand the impact of changing our deduplication approach:

  1. Let's take a look at the raw XBRL data and see if we can find any actively reported null values there - that can help us understand if we'll be missing any instances where a value is changed from non-null to null.
  2. Let's also take a look at the number of non-null values per XBRL context per filing - it may be easy to guess whether a filing is a diff or a snapshot, and then act accordingly.

Then we can think about what the best option for handling this data is.

jdangerx commented 3 months ago

To find the values that are actually reported as null values, I looked for xsi:nil="true" for all the filings that might have 2021 data (e.g. 2022 and 2021 filings.)

What I saw was that these numbers tended not to change much between filings for the most part. I'm not sure if any of these are values that have changed from non-null to null. But it does seem like we should probably take a look at if there is a nice way to tell the difference between diff filings and snapshot filings, since some of these numbers seem non-trivial.

Here are the filings I saw where the number of null values increased between filings over time. The timestamps starting with something like 165... are 2021 filings and 168... are 2022 filings. ``` Alaska_Electric_Light_and_Power_Company_form1_Q4_1650331752.xbrl:4 Alaska_Electric_Light_and_Power_Company_form1_Q4_1681860538.xbrl:193 Avista_Corporation_form1_Q4_1650073347.xbrl:22 Avista_Corporation_form1_Q4_1681871695.xbrl:31 ISO_New_England_Inc._form1_Q4_1650043587.xbrl:9 ISO_New_England_Inc._form1_Q4_1681504193.xbrl:25 Lockhart_Power_Company_form1_Q4_1647971588.xbrl:11 Lockhart_Power_Company_form1_Q4_1679619233.xbrl:13 Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1650337193.xbrl:22 Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1681892382.xbrl:23 Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1650329591.xbrl:11 Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681766565.xbrl:8 Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681844537.xbrl:10 New_York_Transco,_LLC_form1_Q4_1650679051.xbrl:2 New_York_Transco,_LLC_form1_Q4_1681874702.xbrl:4 PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl:12 PJM_Interconnection,_L.L.C._form1_Q4_1681512803.xbrl:13 Southern_California_Edison_Company_form1_Q4_1649310757.xbrl:184 Southern_California_Edison_Company_form1_Q4_1659404061.xbrl:185 Southern_California_Edison_Company_form1_Q4_1680848437.xbrl:153 Startrans_IO,_LLC_form1_Q4_1650319617.xbrl:14 Startrans_IO,_LLC_form1_Q4_1681877357.xbrl:17 Startrans_IO,_LLC_form1_Q4_1689807014.xbrl:17 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1650333900.xbrl:10 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1651188465.xbrl:10 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1652991210.xbrl:13 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1668555358.xbrl:13 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1681851597.xbrl:11 ``` ``` > ag -c 'xsi:nil="true"' | sort ALABAMA_POWER_COMPANY_form1_Q4_1650346677.xbrl:8 ALABAMA_POWER_COMPANY_form1_Q4_1681786461.xbrl:8 ALLETE,_Inc._form1_Q4_1650333678.xbrl:48 ALLETE,_Inc._form1_Q4_1681865705.xbrl:44 Alaska_Electric_Light_and_Power_Company_form1_Q4_1650331752.xbrl:4 Alaska_Electric_Light_and_Power_Company_form1_Q4_1681860538.xbrl:193 Ameren_Illinois_Company_form1_Q4_1649472192.xbrl:2 Ameren_Illinois_Company_form1_Q4_1662778537.xbrl:2 Ameren_Illinois_Company_form1_Q4_1681174555.xbrl:2 Ameren_Transmission_Company_of_Illinois_form1_Q4_1649471880.xbrl:2 Ameren_Transmission_Company_of_Illinois_form1_Q4_1662778046.xbrl:2 Ameren_Transmission_Company_of_Illinois_form1_Q4_1681269271.xbrl:2 American_Transmission_Company_LLC_form1_Q4_1650677382.xbrl:1162 American_Transmission_Company_LLC_form1_Q4_1681850636.xbrl:30 Avista_Corporation_form1_Q4_1650073347.xbrl:22 Avista_Corporation_form1_Q4_1681871695.xbrl:31 Basin_Electric_Power_Cooperative_form1_Q4_1649920868.xbrl:60 Basin_Electric_Power_Cooperative_form1_Q4_1681856053.xbrl:30 CENTRAL_HUDSON_GAS_&_ELECTRIC_CORPORATION_form1_Q4_1650324047.xbrl:1 CENTRAL_HUDSON_GAS_&_ELECTRIC_CORPORATION_form1_Q4_1681434108.xbrl:1 California_Independent_System_Operator_Corporation_form1_Q4_1650089940.xbrl:2 California_Independent_System_Operator_Corporation_form1_Q4_1681787688.xbrl:2 CenterPoint_Energy_Houston_Electric,_LLC_form1_Q4_1681507318.xbrl:2 Central_Hudson_Gas_&_Electric_form1_Q4_1650140284.xbrl:1 Citizens_Sunrise_Transmission_LLC_form1_Q4_1650054717.xbrl:2 Citizens_Sunrise_Transmission_LLC_form1_Q4_1681522045.xbrl:1 Citizens_Sycamore-Penasquitos_Transmission_LLC_form1_Q4_1650324086.xbrl:5 Citizens_Sycamore-Penasquitos_Transmission_LLC_form1_Q4_1681522122.xbrl:3 Connecticut_Yankee_Atomic_Power_Company_form1_Q4_1652466890.xbrl:1 Connecticut_Yankee_Atomic_Power_Company_form1_Q4_1683767371.xbrl:1 DATC_Path_15,_LLC_form1_Q4_1650336567.xbrl:2 DATC_Path_15,_LLC_form1_Q4_1651612870.xbrl:2 DATC_Path_15,_LLC_form1_Q4_1681509693.xbrl:2 Duke_Energy_Carolinas,_LLC_form1_Q4_1650332292.xbrl:2 Duke_Energy_Carolinas,_LLC_form1_Q4_1681509546.xbrl:2 Duke_Energy_Carolinas,_LLC_form1_Q4_1683749074.xbrl:2 Duke_Energy_Carolinas,_LLC_form1_Q4_1691775432.xbrl:2 Duke_Energy_Indiana,_LLC_form1_Q4_1650326662.xbrl:2 Duke_Energy_Indiana,_LLC_form1_Q4_1681783754.xbrl:2 Duke_Energy_Indiana,_LLC_form1_Q4_1683137115.xbrl:2 Duke_Energy_Kentucky,_Inc._form1_Q4_1650328445.xbrl:2 Duke_Energy_Kentucky,_Inc._form1_Q4_1653106736.xbrl:2 Duke_Energy_Kentucky,_Inc._form1_Q4_1681518600.xbrl:2 Duke_Energy_Kentucky,_Inc._form1_Q4_1682374583.xbrl:2 Fitchburg_Gas_and_Electric_Light_Company_form1_Q4_1648787356.xbrl:8 Fitchburg_Gas_and_Electric_Light_Company_form1_Q4_1679449968.xbrl:8 Fitchburg_Gas_and_Electric_Light_Company_form1_Q4_1679516751.xbrl:8 Georgia_Power_Company_form1_Q4_1649884766.xbrl:5 Georgia_Power_Company_form1_Q4_1680321812.xbrl:4 Georgia_Power_Company_form1_Q4_1682045206.xbrl:4 Georgia_Power_Company_form1_Q4_1688178807.xbrl:4 ISO_New_England_Inc._form1_Q4_1650043587.xbrl:9 ISO_New_England_Inc._form1_Q4_1681504193.xbrl:25 Idaho_Power_Company_form1_Q4_1650048652.xbrl:29 Idaho_Power_Company_form1_Q4_1663011551.xbrl:29 Idaho_Power_Company_form1_Q4_1681498784.xbrl:27 Lockhart_Power_Company_form1_Q4_1647971588.xbrl:11 Lockhart_Power_Company_form1_Q4_1679619233.xbrl:13 Madison_Gas_and_Electric_Company_form1_Q4_1650318769.xbrl:18 Madison_Gas_and_Electric_Company_form1_Q4_1681774992.xbrl:14 Maine_Yankee_Atomic_Power_Company_form1_Q4_1652481695.xbrl:1 Maine_Yankee_Atomic_Power_Company_form1_Q4_1652481987.xbrl:1 Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1650337193.xbrl:22 Mckenzie_Electric_Cooperative,_Inc._form1_Q4_1681892382.xbrl:23 Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1650329591.xbrl:11 Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681766565.xbrl:8 Midcontinent_Independent_System_Operator,_Inc._form1_Q4_1681844537.xbrl:10 Morongo_Transmission_LLC_form1_Q4_1652753079.xbrl:7 Morongo_Transmission_LLC_form1_Q4_1681866628.xbrl:3 Mountrail-Williams_Electric_Cooperative_form1_Q4_1681856070.xbrl:6 New_York_Transco,_LLC_form1_Q4_1650679051.xbrl:2 New_York_Transco,_LLC_form1_Q4_1681874702.xbrl:4 PJM_Interconnection,_L.L.C._form1_Q4_1649716496.xbrl:12 PJM_Interconnection,_L.L.C._form1_Q4_1681512803.xbrl:13 Portland_General_Electric_Company_form1_Q4_1649826303.xbrl:70 Portland_General_Electric_Company_form1_Q4_1650930041.xbrl:70 Portland_General_Electric_Company_form1_Q4_1681523942.xbrl:64 SOUTHERN_ELECTRIC_GENERATING_COMPANY_form1_Q4_1650338438.xbrl:2 SOUTHERN_ELECTRIC_GENERATING_COMPANY_form1_Q4_1650504486.xbrl:2 SOUTHERN_ELECTRIC_GENERATING_COMPANY_form1_Q4_1681815561.xbrl:1 San_Diego_Gas_&_Electric_Company_form1_Q4_1681804734.xbrl:2 Southern_California_Edison_Company_form1_Q4_1649310757.xbrl:184 Southern_California_Edison_Company_form1_Q4_1659404061.xbrl:185 Southern_California_Edison_Company_form1_Q4_1680848437.xbrl:153 Startrans_IO,_LLC_form1_Q4_1650319617.xbrl:14 Startrans_IO,_LLC_form1_Q4_1681877357.xbrl:17 Startrans_IO,_LLC_form1_Q4_1689807014.xbrl:17 Superior_Water_Light_&_Power_form1_Q4_1650333855.xbrl:10 Superior_Water_Light_&_Power_form1_Q4_1681865034.xbrl:9 UNION_ELECTRIC_COMPANY_form1_Q4_1649986899.xbrl:2 UNION_ELECTRIC_COMPANY_form1_Q4_1681357173.xbrl:2 Unitil_Energy_Systems,_Inc._form1_Q4_1648788001.xbrl:4 Unitil_Energy_Systems,_Inc._form1_Q4_1679450160.xbrl:4 Unitil_Energy_Systems,_Inc._form1_Q4_1679516714.xbrl:4 Upper_Michigan_Energy_Resources_Corporation_form1_Q4_1650071061.xbrl:36 Upper_Michigan_Energy_Resources_Corporation_form1_Q4_1681790826.xbrl:12 Upper_Missouri_G._&_T._Electric_Cooperative,_Inc._form1_Q4_1649721069.xbrl:16 Upper_Missouri_G._&_T._Electric_Cooperative,_Inc._form1_Q4_1680128870.xbrl:6 Versant_Power_form1_Q4_1650001412.xbrl:8 Versant_Power_form1_Q4_1680661938.xbrl:8 Versant_Power_form1_Q4_1685056664.xbrl:8 Versant_Power_form1_Q4_1689725909.xbrl:8 Wisconsin_Electric_Power_Company_form1_Q4_1650071326.xbrl:56 Wisconsin_Electric_Power_Company_form1_Q4_1681785089.xbrl:8 Wisconsin_Public_Service_Corporation_form1_Q4_1650072443.xbrl:14 Wisconsin_Public_Service_Corporation_form1_Q4_1681786108.xbrl:37 Wisconsin_River_Power_Company_form1_Q4_1650069359.xbrl:3 Wisconsin_River_Power_Company_form1_Q4_1681509642.xbrl:2 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1650333900.xbrl:10 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1651188465.xbrl:10 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1652991210.xbrl:13 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1668555358.xbrl:13 Wolverine_Power_Supply_Cooperative,_Inc._form1_Q4_1681851597.xbrl:11 Yankee_Atomic_Electric_Company_form1_Q4_1652480674.xbrl:8 Yankee_Atomic_Electric_Company_form1_Q4_1683824491.xbrl:1 ```

We see a few fields dominating those null values:

> ag --nofilename 'xsi:nil="true"' | ag '<ferc:' | awk '{print $1}' | sort | uniq -c | sort -rn | head
 743 <ferc:LengthForTransmissionLinesAggregatedWithOtherStructures
 408 <ferc:LengthForStandAloneTransmissionLines
 307 <ferc:BillingDemand
  73 <ferc:GeneralInformationAboutPropertyOfTheRespondentHeldByReceiverOrTrusteeDescription
  49 <ferc:DepreciablePlantBase
  46 <ferc:UtilityPlantEstimatedAverageServiceLife
  46 <ferc:OtherRegulatoryLiabilities
  44 <ferc:UtilityPlantAppliedDepreciationRate
  42 <ferc:ElectricPlantPropertyClassifiedAsHeldForFutureUseExpectedUseInServiceDate
  39 <ferc:FootnoteReferences

``

jdangerx commented 3 months ago

Unfortunately, it seems like we don't have a super clean distribution to differentiate between "definitely a diff" and "definitely a snapshot" - here's the distribution of "number of non-null data values per (context, filing) pair" in the three problem tables. Here "data values" means "values in columns that are not the primary key."

image

And here is a similar distribution, but defining "data values" as "values in columns which have any NAs at all": image

The order here is

[
    "raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant",
    "raw_ferc1_xbrl__summary_of_utility_plant_and_accumulated_provisions_for_depreciation_amortization_and_depletion_200_instant",
    "raw_ferc1_xbrl__electric_energy_account_401a_duration",
]

So you can sort of see that the first and third tables might have a nice cutoff for determining diff-ness vs. snapshot-ness but the second one seems pretty arbitrary.

Upshot: we should dig in and see what values become non-null when using "apply diffs" vs. "best snapshot" vs. "last snapshot" - and then do a spot check of those with the raw data to see what happened there. If the spot check doesn't turn up new/surprising issues, and the number of "values that go from null to non-null because we missed an actively reported null value" isn't super high, then we should go ahead with the "apply diffs everywhere" approach.

Depending on how many of those potentially missed null values there are, we should also prioritize the ferc-to-XBRL refactor. That will probably take a good chunk of time - something on the order of 50 hours, since it radically changes the structure of our XBRL data which the rest of our XBRL transforms rely on.

zaneselvans commented 3 months ago

It's kind of wild that there's so much complexity coming from these refilings / partial updates. @campbellpryde @austinmatherne-wk is this a normal thing for XBRL? Is there a canonical way that it's dealt with? Or is FERC somehow making this extra difficult?

jdangerx commented 3 months ago

To be fair, that’s sort of our fault for not distinguishing between non-reported values and reported null values in our SQLite conversion. Though the FERC filings might be weird in one or another way beyond that…

jdangerx commented 3 months ago

TL;DR: apply-diffs is the best short term solution. The medium-term step after that is moving the XBRL deduplication logic into our transformation framework so we can apply spot fixes. In the future we should be building better tools to help us quickly identify and investigate suspicious data because there will always be some level of manual spot fixing needed.

I compared best-snapshot with apply-diffs for 20 tables and found 55 values which went from null in the best-snapshot case to a non-null value. I also found 295 values which were updated from one non-null value to another non-null value, which I assumed were legitimate updates. So, about 90% of the diff between the two favors apply-diffs.

For the first category, I largely relied on tracing the individual numeric values through the raw XBRL files.

Most of the changes were legitimate updates. There were a few instances where a non-null value for field A was reported for similar-sounding field A' in later reports. Most of the time, field A was never updated with a null value. The only instance I found where a utility reports a null value to overwrite field A was MISO. Good job MISO!

I had one question about some of the values, specifically in the raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant table, that maybe @jrea-rmi could help answer.

We see a de-nullification in the accumulated_depreciation_distribution value for Silver Run Electric for 2021 - which is 5,257,734.

That value shows up as the following facts in Silver Run filings:

Silver_Run_Electric,_Llc_form1_Q4_1650068815.xbrl (2021 filing 1) AccumulatedDepreciationDistribution AccumulatedProvisionForDepreciationOfElectricUtilityPlant DepreciationUtilityPlantInService

Silver_Run_Electric,_Llc_form1_Q4_1654217180.xbrl (2021 filing 2) AccumulatedDepreciationTransmission AccumulatedProvisionForDepreciationOfElectricUtilityPlant DepreciationUtilityPlantInService

Silver_Run_Electric,_Llc_form1_Q4_1681780102.xbrl (2022 filing 1) AccumulatedProvisionForDepreciationOfElectricUtilityPlant

Silver_Run_Electric,_Llc_form1_Q4_1687981013.xbrl (2022 filing 2) AccumulatedProvisionForDepreciationOfElectricUtilityPlant

So this looks like they initially reported that value as "distribution" but later updated it to "transmission." I suppose either report could be in error there, but maybe @jonrea you have better insight into which one is correct.

I spot checked the following other tables:

['raw_ferc1_xbrl__statement_of_income_114_duration',
 'raw_ferc1_xbrl__other_regulatory_liabilities_account_254_278_instant',
 'raw_ferc1_xbrl__steam_electric_generating_plant_statistics_large_plants_402_instant',
 'raw_ferc1_xbrl__electric_operating_revenues_300_duration',
 'raw_ferc1_xbrl__electric_energy_account_401a_duration',
 'raw_ferc1_xbrl__summary_of_depreciation_and_amortization_charges_section_a_336_duration',
 'raw_ferc1_xbrl__electric_plant_in_service_204_instant',
 'raw_ferc1_xbrl__pumped_storage_generating_plant_statistics_large_plants_408_instant',
 'raw_ferc1_xbrl__retained_earnings_118_instant',
 'raw_ferc1_xbrl__statement_of_cash_flows_120_instant',
 'raw_ferc1_xbrl__comparative_balance_sheet_liabilities_and_other_credits_110_instant',
 'raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_functional_classification_section_b_219_instant',
 'raw_ferc1_xbrl__electric_operations_and_maintenance_expenses_320_duration',
 'raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_changes_section_a_219_duration',
 'raw_ferc1_xbrl__electric_plant_in_service_204_duration',
 'raw_ferc1_xbrl__generating_plant_statistics_410_instant',
 'raw_ferc1_xbrl__comparative_balance_sheet_assets_and_other_debits_110_instant',
 'raw_ferc1_xbrl__hydroelectric_generating_plant_statistics_large_plants_406_instant',
 'raw_ferc1_xbrl__summary_of_utility_plant_and_accumulated_provisions_for_depreciation_amortization_and_depletion_200_instant',
 'raw_ferc1_xbrl__accumulated_provision_for_depreciation_of_electric_utility_plant_changes_section_a_219_instant']

I think, before we tackle the FERC 2023 data in earnest, we should make some tools that identify suspicious values and help us quickly investigate them / turn them into spot fixes. But we don't need to do that now.

jrea-rmi commented 2 months ago

Silver Run Electric is a transmission company, so the 2021 filing 2 with value assigned to transmission is correct.

I agree with using apply-diffs short term, would want to go ahead with that as quickly as we can!