catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Enable aggregation of reshaped FERC DBF/XBRL data #2016

Closed zaneselvans closed 1 week ago

zaneselvans commented 1 year ago

Organize the per-row/col IDs compiled in #2015 such that common / useful aggregations of the reported financial data can be recreated programmatically.

# Phase 1
- [x] #2604 @cmgosnell
- [x] spot check xbrl metadata calculations within `income_statement_ferc1`: usable or not? @zaneselvans
- [x] spot check xbrl metadata calculations within `depreciation_amortization_summary_ferc1`: usable or not? @cmgosnell
- [x] spot check xbrl metadata calculations within `electric_operating_expenses_ferc1` @cmgosnell
- [x] Check metadata/calculations for: `balance_sheet_assets_ferc1` @e-belfer
- [x] Check metadata/calculations for: `utility_plant_summary_ferc1` @e-belfer
- [x] Check metadata/calculations for: `plant_in_service_ferc1`
- [x] https://github.com/catalyst-cooperative/pudl/issues/2579 @e-belfer
- [x] check metadata/calculations in `balance_sheet_liabilities_ferc1` @e-belfer
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2581
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2584
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2597
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2619
# Phase 2 - Draft Outputs!
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2646
- [ ] https://github.com/catalyst-cooperative/pudl/issues/1811
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2599
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2683
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2623
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2625
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2624
### Phase 3 - Refinements/Validations
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2700
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2738
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2605
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2736
- [ ] https://github.com/catalyst-cooperative/pudl/pull/2805
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2717
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2703
- [x] null starting balances for corrections
- [x] For 1:1 linkages, merge correction records
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2600
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3203
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3072
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3309
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3530
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3310

Possible future refinements

cmgosnell commented 1 year ago

this is just a note to self for potential future use. a decent way to find which facts in an xbrl table correspond to multiple lines in unstructured tables:

table_name = "yer  table"
# return all of the facts that have "RowStart"
# in their form_loc references
multi_line_facts = [
    fact
    for fact in [fact for fact in xbrl_meta[table_name]["duration"]] 
    if next(
        (
            loc for loc in fact["references"]["form_location"] 
            if "RowStart" in loc.keys() and loc["Form"] == "Form 1"), 
        False
    )
]
zaneselvans commented 1 year ago

Some automated calculation snippets and investigation.

At least for some of the tables, it's not hard to bring in the XBRL calculations and their weights, and apply them to the data values. Looking at the income_statement_ferc1 table, I wanted to use the included calculations to reproduce the totals and subtotals that are reported in the table just to confirm that we understand how these pieces fit together. The income_statement_ferc1 table has lots of calculated rows, but no unstructured data and a simple structure without additional dimensions to consider.

import sqlalchemy as sa
import pandas as pd
import pudl
from dagster import AssetKey
from pudl.etl import defs
from typing import Literal

pudl_settings = pudl.workspace.setup.get_defaults()
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])

def get_calculation(
    fact_name: str,
    table_name: str,
    xbrl_meta: list[dict[str, Any]],
    period_type: Literal["instant", "duration"],
) -> pd.DataFrame:
    table_meta = xbrl_meta[table_name][period_type]
    return pd.json_normalize([x for x in table_meta if x["name"] == fact_name][0]["calculations"])

# Contains the metadata for all tables as JSON
xbrl_meta = defs.load_asset_value(AssetKey("xbrl_metadata_json"))

# Read a denormalized version of the income table:
income = pd.read_sql("denorm_income_statement_ferc1", pudl_engine)
# A list of all the rows marked as calculations in the income table:
calculated_values = list(income[income.row_type_xbrl == "calculated_value"].income_type.unique())
calculated_values
['net_utility_operating_income',
 'net_other_income_and_deductions',
 'taxes_on_other_income_and_deductions',
 'other_income_deductions',
 'other_income',
 'net_income_loss',
 'income_before_extraordinary_items',
 'net_interest_charges',
 'utility_operating_expenses',
 'extraordinary_items_after_taxes',
 'net_extraordinary_items']

Extract calculation names and weights

calc_dfs = {}
for cv in calculated_values:
    calc_dfs[cv] = get_calculation(
        fact_name=cv,
        table_name="income_statement_ferc1",
        xbrl_meta=xbrl_meta,
        period_type="duration",
    )

Perform the specified calculations

We can't necessarily just merge the weights onto the income dataframe, because we know there are instances in which the same values appear in more than one calculation, even within a single dataframe, sometimes with different weights. Instead we grab the relevant rows from the income dataframe by merging with the calculation names/weights dataframe, apply the weights, and sum up the results into a single value that can be recombined with the income dataframe afterwards:

calculated_income_dfs = []
for cv in calculated_values:
    calculated_income_dfs.append(
        pd.merge(
            income,
            calc_dfs[cv],
            left_on="income_type",
            right_on="name",
        )
        .assign(calculated_income=lambda x: x.income * x.weight)
        .groupby(["utility_id_ferc1", "report_year"])
        .calculated_income.sum()
        .to_frame()
        .assign(income_type=cv)
        .reset_index()
    )

calculated_income = pd.merge(
    income,
    pd.concat(calculated_income_dfs),
    how="left",
)

Check calculation results

Calculate some metrics of relative & absolute difference between the directly reported and calculated values, so we can see how close our calculated_income column is to the "calculated" values reported in the original income column.

There are ~70,000 calculated records across all the years from 1994-2021. ~30,000 of them match exactly and ~25,000 are off by exactly a factor of 2, so there's probably some kind of sign error or double reporting happening that we could fix. The remaining 15,000 calculated values are off by a random amount, distributed in a bell-ish curve around a factor of 1.5x.

calculated_income["abs_diff"] = abs(calculated_income.income - calculated_income.calculated_income)
calculated_income["rel_diff"] = abs(calculated_income.abs_diff / calculated_income.income)

to_plot = calculated_income[np.isfinite(calculated_income.rel_diff)]
plt.hist(to_plot.rel_diff, bins=100, range=(0,3);

image

to_plot = calculated_income[np.isfinite(calculated_income.rel_diff)]
to_plot = to_plot[(to_plot.rel_diff != 0) & (to_plot.rel_diff != 1.0)]

plt.hist(to_plot.rel_diff, bins=100, range=(0,3))

image

Distribution of mismatches?

Are the bad calculations clustered in time? Are particular calculated values worse than others?

How many of each type of calculated value is there actually?

calculated_income[
    (calculated_income.row_type_xbrl == "calculated_value")
].income_type.value_counts().sort_index()
income_type
extraordinary_items_after_taxes 270
income_before_extraordinary_items 5610
net_extraordinary_items 265
net_income_loss 5608
net_interest_charges 5494
net_other_income_and_deductions 5642
net_utility_operating_income 19373
other_income 5556
other_income_deductions 5134
taxes_on_other_income_and_deductions 4682
utility_operating_expenses 13746

How many are actually bad?

calculated_income[
    (calculated_income.row_type_xbrl == "calculated_value")
    & (~np.isclose(calculated_income.calculated_income, calculated_income.income))
].income_type.value_counts().sort_index()
income_type
income_before_extraordinary_items 5585
net_extraordinary_items 3
net_income_loss 4
net_interest_charges 3
net_other_income_and_deductions 2
net_utility_operating_income 19204
other_income 3
other_income_deductions 1870
taxes_on_other_income_and_deductions 497
utility_operating_expenses 13578

How is the badness distributed in time?

bad_calcs = calculated_income[
    (calculated_income.row_type_xbrl == "calculated_value")
    & (~np.isclose(calculated_income.calculated_income, calculated_income.income))
].groupby("report_year").income_type.count()

all_calcs = calculated_income[
    (calculated_income.row_type_xbrl == "calculated_value")
].groupby("report_year").income_type.count()

plt.plot((bad_calcs / all_calcs))
plt.ylabel("Fraction of bad income calculations");

image

And without the off-by-2 errors?

bad_calcs = calculated_income[
    (calculated_income.row_type_xbrl == "calculated_value")
    & (~np.isclose(calculated_income.calculated_income, calculated_income.income))
    & (calculated_income.rel_diff != 1.0)
].groupby("report_year").income_type.count()

all_calcs = calculated_income[
    (calculated_income.row_type_xbrl == "calculated_value")
].groupby("report_year").income_type.count()

plt.plot((bad_calcs / all_calcs))
plt.ylabel("Fraction of bad income calculations");

image

cmgosnell commented 1 year ago

hey @zaneselvans & @jrea-rmi ! I did a similar investigation on the depreciation_amortization_summary_ferc1 table and got almost universally good results using the metadata directly.

Setup

# setup/inputs
table_name = "depreciation_amortization_summary_ferc1"
xbrl_factoid = "ferc_account_label" # several places in params
dollar_col = "depreciation_amortization_value" # also the value_types in wide_to_tidy
tbl_df = pd.read_sql(table_name, pudl_engine)

xbrl_meta = defs.load_asset_value(AssetKey("xbrl_metadata_json"))
tbl_trans = pudl.transform.ferc1.DepreciationAmortizationSummaryFerc1TableTransformer()
xbrl_metadata_table = tbl_trans.process_xbrl_metadata(xbrl_meta[table_name])

def rename_calcuations_xbrl_meta(xbrl_metadata_table: pd.DataFrame, dollar_col: str)-> dict[str, list]:
    """Rename the calculations in the xbrl metadata to reflect PUDL names.

    (I have hope we can do this across the board but this is not generic enough rn)
    """
    calced_values = set(
        xbrl_metadata_table.loc[
            xbrl_metadata_table.row_type_xbrl == "calculated_value", "xbrl_factoid"
        ]
    )
    calcs = {
        calced_value: eval(xbrl_metadata_table.set_index(["xbrl_factoid"]).loc[calced_value,"calculations"]) 
        for calced_value in  calced_values
    }
    calcs_renamed = {
        calced_value: 
        [{
            keys: 
            str(tbl_trans.params.rename_columns_ferc1.duration_xbrl.columns.get(value, value)).replace(f"_{dollar_col}", "")
            for (keys, value) in calc.items()} for calc in calcs] 
        for (calced_value, calcs) in calcs.items()
    }
    return calcs_renamed

Check for this table

calcs_renamed = rename_calcuations_xbrl_meta(xbrl_metadata_table, dollar_col)
# CHECK ONE CALC
calculated_fact_name = "depreciation_amortization_total"
# this one got renamed twice.... harder to track but not impossible
fact_calc = calcs_renamed["depreciation_and_amortization"]

pks = (
    pudl.metadata.classes.Package.from_resource_ids()
    .get_resource(table_name)
    .schema.primary_key
)
pks_wo_factoid =[col for col in pks if col != xbrl_factoid]

calc_step_col_names = set([
    step["name"] for step in fact_calc
])
test = (
    tbl_df[tbl_df[xbrl_factoid].isin(calc_step_col_names)]
    .merge(
        pd.DataFrame(fact_calc),
        left_on=[xbrl_factoid],
        right_on=["name"],
        how="left"
    )
    .groupby(pks_wo_factoid, observed=True)
    .sum(min_count=1, numeric_only=True)
    .merge(
        tbl_df.loc[tbl_df[xbrl_factoid] == calculated_fact_name]
        .set_index(pks_wo_factoid)[[dollar_col]],
        right_index=True,
        left_index=True,
        how="outer",
        validate="1:1",
        suffixes=("_calced","_total")
    )
)

baddies = test[
    (test[f"{dollar_col}_calced"] != test[f"{dollar_col}_total"])
    & (test[f"{dollar_col}_calced"].notnull() | test[f"{dollar_col}_total"].notnull())
]
logger.info(f"Non-matching records: {len(baddies)/len(test):.2%}")

baddies_not_close = test[
    ~np.isclose(test[f"{dollar_col}_calced"],test[f"{dollar_col}_total"])
    & (test[f"{dollar_col}_calced"].notnull() | test[f"{dollar_col}_total"].notnull())
]
logger.info(f"Non-close records: {len(baddies_not_close)/len(test):.2%}")

Results

Non-matching records: 0.07%
Non-close records: 0.04%

There are literally 30 totals that don't match up perfectly. There are 15 that aren't np.isclose. Most of these don't actually have values in the total line:

image
jrea-rmi commented 1 year ago

@zaneselvans great writeup. income_before_extraordinary_items, net_utility_operating_income, and utility_operating_expenses look like they're the biggest sources of badness. The common 2x error makes it seem likely that it's because the weight of the subtotal line is wrong? What's the way to fix that? Or the same relative error = 1 would come up if there are missing values, right? I actually think missing values are more likely than the calculation result being 1/2 of the subtotal line.

About 1000 of the mismatches (the ones very close to 0 or 1 in your second histogram) are probably because of rounding?

I also notice in denorm_income_statement_ferc1, 2021 data only has utility_type = electric for Duke Energy Progress, but in other years they have utility_type = electric, other, and total. I also only see lines from the top half of the income statement included. Do you know why? Is that part of the missing values leading to relative errors of 1?

jrea-rmi commented 1 year ago

@cmgosnell awesome. Those calcs must all be correct then, and the only mismatches are because of bad reporting.

What do you think about filling na values in the totals with the calced value? I expect the exploded table will only use the components that get to the calced value, so the totals wouldn't be included so effectively that's what would happen anyway?

zaneselvans commented 1 year ago

Concerns from chat w/ Christina.

cmgosnell commented 1 year ago

@cmgosnell awesome. Those calcs must all be correct then, and the only mismatches are because of bad reporting.

I know 🌟 ! I was very pleasantly surprised.

What do you think about filling na values in the totals with the calced value? I expect the exploded table will only use the components that get to the calced value, so the totals wouldn't be included so effectively that's what would happen anyway?

I do think this is effectively what will happen in an exploded version! we could certainly fill in the total preemptively. I think if what we want in the end is the sub-components, not the total than having the total filled in doesn't really matter too much. Unless you also want the totals for more than just checking the sub-components! It would be easy to add in but still another step

jrea-rmi commented 1 year ago

I don't think we need the totals for more than checking the sub-components...unless we would somehow use the totals to fill in missing subcomponents. But I'd rather think about that kind of data fix later, so good for now!

zaneselvans commented 1 year ago

@jrea-rmi I suspect that net_utility_operating_income and utility_operating_expenses are bad because they're composite calculations -- totals of subtotals -- so any errors in any of their components will create errors at the higher level of aggregation. This is something that'll happen across the board too so if we have lots of layers of calculations, I think we would expect the top layer to be the most error prone as it'll be affected by bad data from all the lower levels.

I'm suspicious about the bell-shaped curve in the errors, and am wondering if it might just be due to compositions of several simple underlying errors -- like if you get an off by 2x error mixed with a perfect match in an aggregation, maybe you end up with an off by 1.5x error in that aggregation.

We could get rid of a few more bad calculations with more generous interpretations of what's equal to 0 and 1, but I tried with generous tolerances on np.isclose() too and it didn't make a huge difference. I'm hopeful that we can get rid of more of the errors "for real" with some closer investigation of the individual calculations.

I'll look at the diversity of utility types in here and see if that's a problem. I'm not sure what you mean about only seeing values from the top part of the income table. Can you say more?

jrea-rmi commented 1 year ago

@zaneselvans the pdf FERC Form 1 has the income statement in two parts. For Duke Energy Progress in 2021, I only see the first part, rows 1-26 included.

jrea-rmi commented 1 year ago

@zaneselvans I think net_utility_operating_income has got to have some kind of systematic error that can be fixed with a sign change or other modification. 19204 errors out of 19373 lines is because of a bigger (but likely simpler) issue than just errors in components.

zaneselvans commented 1 year ago

@jrea-rmi Where are you looking that you're only seeing the first part?

jrea-rmi commented 1 year ago

@zaneselvans I downloaded denorm_income_statement_ferc1 from datasette, filtered to utility_id_ferc1 = 160 andyear = 2021

zaneselvans commented 1 year ago

Things to explore:

99% bad income_before_extraordinary_items calculations

The inter-page carryover in the income table appears to have resulted in a bad calculation somehow. The calculation for income_before_extraordinary_items is missing net_utility_operating_income which is the value carried over from one page to the next. This is a weird problem that I think only shows up in the income table, which is split. I suspect this is why almost none of the calculations for income_before_extraordinary_items come out correct:

image

However, adding net_utility_operating_income into the calculation does not magically fix things. Need to figure out what impact it does have.

Off by 2x Errors

I suspect that the off-by-2x errors are the result of many utilities reporting data by utility type (electricity, gas, other) and also having those values show up under utility_type=="total". Somehow we need to ensure that we only get a single copy of that data, but I'm not sure if we can universally remove the total rows and still get all the data. It seems like the page-2 values are reported only for the total utility, rather than being attributed to individual utility types, in which case only some of the calculations will apply to particular utility types.

It looks like the total utility type no longer exists as of 2021. Earlier years have electric, gas, other, other1, and total but 2021 only has electric, gas, and other. Seems like we need to manage this change in the aggregation / calculation such that the data has the same structure across all of the years, or such that appropriate calculations are applied to the two different sets of years.

Missing page-2 data

Jon says that Duke data from page 2 is missing in the 2021 data. That does kind of seem to be true, but there are also examples of every different value of income_type in the 2021 data, so at least in some cases page 2 data is clearly making it into the DB. Some values of income_type are very commonly reported, and others are quite rare, and this split appears across years, not just in 2021.

cmgosnell commented 1 year ago

hm zane i think you should at least start these calculations using the pk’s of the table minus the xbrl_factoid column. I did this by:

table_name="income_statement_ferc1"
xbrl_factoid = "income_type"
pks = (
    pudl.metadata.classes.Package.from_resource_ids()
    .get_resource(table_name)
    .schema.primary_key
)
pks_wo_factoid =[col for col in pks if col != xbrl_factoid]

and then use pks_wo_factoid as the keys to groupby (this is an edit of your snippet from your previous comment - the only thing I changed is the gb's by - i know you edited this locally so idk what you've got now but you get the picture):

for cv in calculated_values:
    calculated_income_dfs.append(
        pd.merge(
            income,
            calc_dfs[cv],
            left_on="income_type",
            right_on="name",
        )
        .assign(calculated_income=lambda x: x.income * x.weight)
        .groupby(pks_wo_factoid)
        .calculated_income.sum()
        .to_frame()
        .assign(income_type=cv)
        .reset_index()
    )

i think it is highly likely that these 2x error are just utility_type=="total" + utility_type=="electric" being added together.

Having a total utility_type or another kind of total label is a common pattern across these tables. we'll need to come up with a standard process to identify them + associate them with their intra-table sub-components + check sums + remove the total rows for the final exploded table. I think we were very consistent with naming these rows total across the tables so these honestly shouldn't be too hard to id.

zaneselvans commented 1 year ago

There's a complication here in that not all of the values being reported show up under total as well as the other values of utility_type. In the DBF data the numbers are broken out by utility_type only on the first page of the income statement form (up through the net_utility_operating_income field) and then only totals are reported for the 2nd page. In the XBRL data there's no total utility_type though, so are all of the value broken down?

It does seem like this has to be where the 2x errors are coming from, especially since they only show up in the DBF data.

cmgosnell commented 1 year ago

There's a complication here in that not all of the values being reported show up under total as well as the other values of utility_type.

I think we should first check to see that the calcs calculate within each utility_type. If they do (or we can reconcile the problems w/ this 2-page issue), then we can check the total's. If the components add up to the total then we effectively wanna drop the total's. Or as jon suggested for the 🧼 super clean deprish table, we could fill in the null totals with the sum of sub-components. But that feels secondary to the goal here.

In the XBRL data there's no total utility_type though, so are all of the value broken down?

Yes!

zaneselvans commented 1 year ago

Stuff to look at

Generic calculation application code

For @cmgosnell. This probably isn't quite cut-and-paste as I've tried to remove references to the income table specific bits, but haven't tested it...

def get_calculation(
    fact_name: str,
    table_name: str,
    xbrl_meta: list[dict[str, Any]],
    period_type: Literal["instant", "duration"],
) -> pd.DataFrame:
    table_meta = xbrl_meta[table_name][period_type]
    # Note that the version of the calculations that is stored in our transformed metadata is not
    # technically valid JSON because it uses single-quotes. We should fix that in the metadata
    # transform with something like metadata_str.replace("'", '"') so you can use JSON tooling on it.
    return pd.json_normalize([x for x in table_meta if x["name"] == fact_name][0]["calculations"])

table_name = "income_statement_ferc1"
xbrl_factoid_name = "income_type"

table_df = pd.read_sql(table_name, pudl_engine)
calculated_values = list(table_df[table_df.row_type_xbrl == "calculated_value"][xbrl_factoid_name].unique())

calc_dfs = {}
for cv in calculated_values:
    calc_dfs[cv] = get_calculation(
        fact_name=cv,
        table_name=table_name,
        xbrl_meta=xbrl_meta,
        period_type="duration",
    )

calculated_dfs = []
for cv in calculated_values:
    df = (
        pd.merge(
            table_df,
            calc_dfs[cv],
            left_on=xbrl_factoid_name,
            right_on="name",
        )
        .assign(calculated_dollar_amount=lambda x: x.dollar_amount * x.weight)
        .groupby(pk_wo_factoid)
        .calculated_df.sum()
        .to_frame()
        .reset_index()
    )
    df[xbrl_factoid_name] = cv
    calculated_dfs.append(df)

calculated_df = pd.merge(
    table_df,
    pd.concat(calculated_dfs),
    on=["report_year", "utility_id_ferc1", xbrl_factoid_name],
    how="left",
)

calculated_df["abs_diff"] = abs(calculated_df.dollar_amount - calculated_df.calculated_dollar_amount)
calculated_df["rel_diff"] = abs(calculated_df.abs_diff / calculated_df.dollar_amount)
jrea-rmi commented 1 year ago

@zaneselvans yes, the second half of the income statement doesn't differentiate utility type so it's all total, and that makes it seem like the total values from the first half of the income statement are necessary to include.

You're trying to avoid redundant information when a utility only has one utility_type?

Is it possible that records are being dropped, or you think they're nonexistent in the raw xbrl data?

zaneselvans commented 1 year ago

I'm trying to understand how we need to apply the XBRL calculations such that they reproduce the reported calculated values. Adding up all the utility types on page one (including total) results in a duplication of the dollars, so the number comes out wrong. We either need to just add up the non-total values, or use the reported total value in the hand-off from page 1 to page 2. But within each of the utility types on page 1, we should still expect the calculations to add up to the reported value.

In the XBRL data (2021) it looks like all of the reported values that previously showed up on either page 1 or page 2 are differentiated between specific utility types and utility_type==total no longer exists. It also seems like the values that previously showed up only under utility_type=total on page 2 are relatively rare. They do all show up somewhere across the XBRL data, but not with anything close to the same frequency as the values that were previously on page 1. It just seems like Duke didn't report any of those values in 2021. But I'm going to try labeling all the rows as "page 1" or "page 2" just for clarity to make checking easier.

zaneselvans commented 1 year ago

Okay, after some more futzing we're in much better shape with the income_statement_ferc1 table.

PR #2565 fixes 2 issues in the transform for the table. First, the other_total utility_type was getting categorized as other when it should have been total and second, the line on the boundary between the two pages in the 1994-2020 data was being reported twice (since it carried over from page 1 to page 2).

There are also two apparent errors in the calculations from the XBRL metadata.

We should figure out a clean and simple way to fix calculation errors in the metadata transformations (and tell FERC?). What I did was:

calc_dfs["income_before_extraordinary_items"] = pd.DataFrame([
    ("net_other_income_and_deductions", 1.0),
    ("net_interest_charges", -1.0),
    ("net_utility_operating_income", 1.0),
], columns=["name", "weight"])

(
    calc_dfs["taxes_on_other_income_and_deductions"]
    .loc[calc_dfs["taxes_on_other_income_and_deductions"].name == "investment_tax_credits", "weight"] = -1.0
)

Having made those fixes, there's about 1000 calculations that are off by more than 1%, our of 58,000 total calculations:

image

A steady ~20% of all the calculations of other_income_deductions are wrong, from 1994-2003 (constituting about 5% of all calculated values in those years). Everything else is pretty much perfect. I suspect there's one more but to fix that will eliminate this remaining class of errors.

As with Christina's table, there's a number of records with 0.0 or NA values in the reported "calculated" spot, which we could fill in using our calculated values, and then there are also a total of 8 reporting errors that don't fit any obvious pattern.

image

image

So it still seems to me like the overall calculation infrastructure is very good, but there will probably be a few kinks to work out with each table, either because the calculation metadata is imperfect, or because we discover bugs in the transforms, or because there are real data reporting issues, but thankfully so far those seem to be very limited!

cmgosnell commented 1 year ago

i've started investigating the off-ness in the electric_operating_expenses_ferc1 table.

Almost all of the baddness is coming from what looks like double counting in three values: 'power_production_expenses_steam_power', 'transmission_operation_expense', 'power_production_expenses_hydraulic_power'

image
jrea-rmi commented 1 year ago

@zaneselvans I looked at the pre-2003 FERC Form 1. The difference for Other Income Deductions is that in old years, FERC accounts 426.1-426.5 were all included in one line and in new years, each account is listed on its own separate line.

The Total Other Income Deduction line is a simple sum of the other lines in that category, with weights all = 1; rows 41-43 in old years and 43-49 in new years. But the XBRL calc is different than that?

jrea-rmi commented 1 year ago

@cmgosnell I don't see any form changes over time relevant to 'power_production_expenses_steam_power' or 'power_production_expenses_hydraulic_power'. The transmission section of the form added detail in 2006.

All calcs in XBRL seem to be simple sums with weight = 1 for all lines. What could cause double counting?

zaneselvans commented 1 year ago

The problem is that since those 3 FERC Accounts were lumped together into a single row in the old years (1994-2003) the XBRL metadata doesn't know what to do with them. We named that row miscellaneous_deductions and we need to inject a calculation that integrates those values into the overall calculation in those old years.

cmgosnell commented 1 year ago

a potential end state table of the income statement table:

This will stand alongside the information that defines the relationships between the xbrl_factoid's. So for instance if a record with an xbrl_factoid name of residential_sales will show that it is a part of the electric_operating_revenues which is originally part of the electric_operating_revenues_ferc1 table which is then linked to the highest level income statement table.

So the provenance for this field is something like this:

Another different example:

The income statement table is partially broken out by plant_type (steam, hydro, other etc). Some of the fields in the other tables have some utility_type information embedded in them but we didn't melt or stack those tables in the transform process to make them have a plant_type column. Perhaps we should in the fullness of time. For now I think we could pretty easily fill in the plant_type by doing some simple string matching ( if steam shows up in the factoid's provenance).

zaneselvans commented 1 year ago

Okay there are no more apparent systematic errors in the income_statement_ferc1 table. There are a few dozen calculations that are off by literally $1, and then 8 which are truly wrong and look like they may be due to a lag in folks updating the version of the form they were using around 2003 when some additional rows were added.

image

image

jrea-rmi commented 1 year ago

amazing. Definitely as good as we can expect out of using FERC data!!

jrea-rmi commented 1 year ago

a potential end state table of the income statement table:

  • report_year
  • xbrl_factoid
  • utility_id
  • plant_type ( null sometimes )
  • utility_type ( null sometimes )
  • dollar_amount

This will stand alongside the information that defines the relationships between the xbrl_factoid's. So for instance if a record with an xbrl_factoid name of residential_sales will show that it is a part of the electric_operating_revenues which is originally part of the electric_operating_revenues_ferc1 table which is then linked to the highest level income statement table.

So the provenance for this field is something like this:

  • income_statement_ferc1

    • electric_operating_revenues_ferc1

    • electric_operating_revenues

      • residential_sales

Another different example:

  • income_statement_ferc1

    • electric_operating_expenses_ferc1

    • generation_operations_expense

      • steam_power_generation_operations_expense

The income statement table is partially broken out by plant_type (steam, hydro, other etc). Some of the fields in the other tables have some utility_type information embedded in them but we didn't melt or stack those tables in the transform process to make them have a plant_type column. Perhaps we should in the fullness of time. For now I think we could pretty easily fill in the plant_type by doing some simple string matching ( if steam shows up in the factoid's provenance).

Are you saying that end state is the list of columns in the final income_statement_ferc1?

Why would utility_type be null sometimes? You would use null rather than "total" for the lines that correspond to all utility operations?

One use case would be to consider total operation and total maintenance expense for each plant_type for a specific utility. To do that, I'd take the income_statement_ferc1, add a column for the level of "generation_operations_expense" via the xbrl_factoid relationship, then groupby year/utility id/plant type/generation_operations_expense. Does that sound right?

cmgosnell commented 1 year ago

hey @jrea-rmi ! we ran into a minor but still real obstacle and want to check-in about it.

In the electric_operating_revenues_ferc1 table there is the new fun quirk of having components of calculations that only exist in another table (in this case the electricity_sales_by_rate_schedule_ferc1 table). Many of these calculated fields are easy because all sub-components show up in the rate schedule table (i.e. the calced field is composed of {field_name}_billed + {field_name}_unbilled). BUT there is one fun calculated field where one sub-component shows up in another table. The calcualted field is called sales_to_ultimate_consumers and the sub-component is the commercial_and_industrial_sales.

I'm sure you can see that this poses a problem for the eventual exploded table because if we dropped the sales_to_ultimate_consumers the sub-components would be wrong and if we kept the sales_to_ultimate_consumers there would be duplication. Alternatively, we could drop all of the sub-components of sales_to_ultimate_consumers and have no duplication but also drop this level of detail. @e-belfer is doing some spot checking to see if we could use this rate schedule table to complete the calculation. But in general it seems like we may need to add electricity_sales_by_rate_schedule_ferc1 into the mix for at least this calculation in order to preserve the detail. Also this seems to only be a problem for 2021. It looks like commercial_and_industrial_sales was in the dbf table.

jrea-rmi commented 1 year ago

@cmgosnell I think there is an error in the XBRL calc for large_or_industrial_sales. It should be adding the unbilled + billed portions from the electricity_sales_by_rate_schedule_ferc1 table but instead includes MWh sold from the sales_to_ultimate_consumers table.

image

small_or_commercial_sales has the correct parallel calculation:

image

e-belfer commented 1 year ago

@e-belfer is doing some spot checking to see if we could use this rate schedule table to complete the calculation.

22 of the 133 (~16%) utilities with inconsistent sales_to_ultimate_consumers values have total billing values which exactly explain the difference in the rate schedule table. See below for the code I used here.

import sqlalchemy as sa
pudl_engine = sa.create_engine(pudl_settings["pudl_db"])
rate_sched = pd.read_sql("electricity_sales_by_rate_schedule_ferc1", pudl_engine)

sub = rate_sched.set_index(['utility_id_ferc1', 'report_year'])
sub = sub.loc[off_worst_index]
comm_ind = sub.loc[(sub.rate_schedule_type == 'commercial_and_industrial') & (sub.rate_schedule_description == "total"), 'sales_revenue']

offenders = calced_df.loc[(calced_df[xbrl_factoid_name].isin(worst_calc_components))].sort_values(xbrl_factoid_name).set_index(pks_wo_factoid).sort_index().loc[off_worst_index]
ult_con = offenders.loc[offenders.revenue_type == 'sales_to_ultimate_consumers']

diff = ult_con.merge(comm_ind, left_index=True, right_index=True, how = 'left')
diff['difference'] = diff['abs_diff'] - diff['sales_revenue']
diff.loc[np.isclose(diff.difference, 0)].reset_index(drop = False).drop_duplicates(['utility_id_ferc1', 'abs_diff', 'report_year'])
jrea-rmi commented 1 year ago

Both the dbf and xbrl forms of electric_operating_revenues_ferc1 include the values of the small_or_commercial_sales and large_or_industrial_sales. Is an alternative option to use the value reported there rather than going to electricity_sales_by_rate_schedule_ferc1 ?

e-belfer commented 1 year ago

For most rows yes, this works as expected. These are also the two columns that appear in the metadata that describes the calculation. However, for all the problematic 2021 records both small_or_commercial_sales and large_or_industrial_sales appear to be NA. In many of these cases, the exact matching difference can be found in the rate schedule table (either in the commercial_and_industrial column, or in a few instances by combining the commercial and industrial columns manually).

jrea-rmi commented 1 year ago

that's annoying, the 2021 records are often NA?

then maybe the solution of fixing the xbrl calculation to use billed + unbilled fields from electricity_sales_by_rate_schedule_ferc1 instead of the incorrect mwh sales for large_or_industrial_sales is best?

Will also note that the revenue detail of the income statement is least important for us short term, relative to balance sheet breakdown as most valuable and income statement expenses breakdown as second most valuable.

e-belfer commented 1 year ago

By far the largest source of errors with this table is the sales_to_ultimate_consumers column - luckily, the other columns have calculations that are essentially correct. Of the 133 utilities that have errors with this column, all report errors in the year 2021 and all have NA values for both of the two subcomponents of the sales_to_ultimate_consumers we're discussing: large_or_industrial_sales and small_or_commercial_sales. This seems to be at least one of the primary sources of the error, and I agree that getting the billed & unbilled fields from the electricity_sales_by_rate_schedule_ferc1 and filling in missing components should at least make a substantial improvement in the correctness of this table.

@cmgosnell Let's touch base about what makes sense as a next step here?

e-belfer commented 1 year ago

Updates on utility_plant_summary_ferc1:

image

One large remaining source of error appears to have to do with wrong-signed (negative) values in the depreciation column for the field depreciation_utility_plant_in_service (see histogram below). These values often exactly sum to the difference between the calculated and reported fields. I'm hesitant to apply a blanket fix here as depreciation can technically be negative - thoughts?

image

jrea-rmi commented 1 year ago

0.3% error is really good!

Which duplicated set of components for utility_plant_net did you remove? I see FERC taxonomy has two calculations for that field, one referencing balance sheet values and one referencing utility plant summary values. Makes sense to me to keep the one that stays within the utility plant summary table.

depreciation_amortization_and_depletion_utility_plant_in_service does seem like it's the the sum of 4 other fields that are also included in the calculation of accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility. That'd be double-counting, so it makes sense to me to remove it from the calculation.

Seems like the sign of depreciation_utility_plant_in_service doesn't need to be switched for all entries. Do you have an example of a specific utility/year where the difference between calculated & reported fields is equal to depreciation_utility_plant_in_service?

e-belfer commented 1 year ago

Which duplicated set of components for utility_plant_net did you remove? I see FERC taxonomy has two calculations for that field, one referencing balance sheet values and one referencing utility plant summary values. Makes sense to me to keep the one that stays within the utility plant summary table.

The calculation within the metadata we processed for the FERC table appears as one equation with both components duplicated (A - B + A - B), so I've actually just removed the duplication and kept the values that stay within the utility plant summary table.

Seems like the sign of depreciation_utility_plant_in_service doesn't need to be switched for all entries. Do you have an example of a specific utility/year where the difference between calculated & reported fields is equal to depreciation_utility_plant_in_service ?

This isn't a particularly common occurrence. It's most visible in utility 211 (The Dayton Power and Light Company), where the calculated value is 2x off from the reported value because of sign flips. There are 29 records that match this pattern, for 4 utilities.

image

jrea-rmi commented 1 year ago

That's clear, makes a lot of sense, love it.

Is PUDL comfortable flipping the sign for those 29 records, since it's a clear error?

e-belfer commented 1 year ago

Yes, we'll definitely flip the clear-cut sign cases in the transform process.

e-belfer commented 1 year ago

Notes on the remaining tables and their quirks:

electric_operating_revenues_ferc1

plant_in_service_ferc1

(Below, calc_bal is the sum of the columns in the table ,calculated_dollar_amount is the calculated value for ending_balance and all rows with an xbrl_factoid_name of NaN refer to production_plant or other_production_plant. image

zaneselvans commented 1 year ago

Hmm. The all NA small/commercial and large/industrial revenues seem suspicious. Looking at the raw FERC 1 data they seem to mostly have real values:

op_rev = pd.read_sql("f1_elctrc_oper_rev", ferc1_engine)
cols = ["respondent_id", "report_year", "row_number", "rev_amt_crnt_yr", "rev_amt_prev_yr"]
op_rev[op_rev.row_number.isin([4, 5])][cols].sample(20)
respondent_id report_year row_number rev_amt_crnt_yr rev_amt_prev_yr
173428 177 2017 4 2.31287e+08 nan
84596 22 2008 5 4.79067e+07 686311
9397 31 1997 4 3.81368e+08 3.83621e+08
122797 41 2011 5 7.76402e+08 7.44984e+08
125657 99 2012 4 5.50053e+07 nan
122497 117 2011 4 1.24929e+08 nan
79276 152 2007 4 7.49067e+07 6.31548e+07
30084 98 2001 4 6.77594e+07 6.4705e+07
67057 61 2006 4 3.75824e+07 3.73708e+07
46274 57 2004 5 5.50443e+08 nan
209070 41 2020 4 1.43282e+09 1.52185e+09
175068 99 2017 4 1.37866e+08 nan
186406 522 2018 5 3.86355e+06 nan
109840 57 2010 4 2.30548e+09 2.14278e+09
192003 73 2019 5 1.3964e+08 nan
68131 32 2006 4 1.66091e+09 nan
39493 137 2003 4 9.84466e+07 9.7564e+07
60984 126 2005 5 4.45426e+08 4.31777e+08
73979 88 2007 4 5.14513e+07 nan
160316 161 2015 4 6.18223e+09 6.23493e+09

Similarly, there seem to be lots of values in the XBRL data:

op_rev_xbrl = pd.read_sql("electric_operating_revenues_300_duration", ferc1_xbrl_engine)
cols = [
    "start_date",
    "end_date",
    "entity_id",
    "small_or_commercial_sales_electric_operating_revenue",
    "large_or_industrial_sales_electric_operating_revenue",
]
op_rev_xbrl[cols].sample(20)
start_date end_date entity_id small_or_commercial_sales_electric_operating_revenue large_or_industrial_sales_electric_operating_revenue
404 2021-01-01 2021-12-31 C005443 1.86814e+07 1.05704e+08
136 2020-01-01 2020-12-31 C000525 nan nan
387 2020-01-01 2020-12-31 C004872 nan nan
482 2020-01-01 2020-12-31 C000500 3.51837e+08 2.18917e+08
461 2021-01-01 2021-12-31 C000691 2.55137e+08 3.87907e+08
285 2021-01-01 2021-06-30 C000029 nan nan
38 2020-01-01 2020-12-31 C007667 1.73938e+08 2.35152e+07
422 2020-01-01 2020-12-31 C000415 1.73855e+09 5.91771e+08
466 2021-01-01 2021-12-31 C000905 1.36827e+08 1.11768e+08
40 2021-01-01 2021-06-30 C007667 nan nan
408 2020-01-01 2020-12-31 C005475 nan nan
32 2020-01-01 2020-12-31 C000534 3.7136e+07 4.5495e+07
153 2020-01-01 2020-12-31 C000135 1.20732e+09 6.47181e+08
384 2021-01-01 2021-12-31 C001466 1.9328e+08 1.85288e+08
239 2021-01-01 2021-12-31 C001288 1.009e+08 1.68781e+08
274 2021-01-01 2021-12-31 C000315 1.31456e+08 6.43504e+07
411 2021-01-01 2021-12-31 C001654 nan nan
124 2020-01-01 2020-12-31 C000290 2.18603e+09 1.1357e+09
389 2021-01-01 2021-12-31 C001183 nan nan
46 2021-01-01 2021-12-31 C003184 nan nan

We should probably go look at the transform for this table and see if we're accidentally nulling out these columns somehow.

Seems like the Plant In service issues are real data issues unrelated to the XBRL metadata.

e-belfer commented 1 year ago

@cmgosnell One consistent issue: in general, where there is more than one calculation provided in the JSON for a single field, our current code mashes them together. I've pulled out some of these duplications by removing any duplicated components, but this creates issues that aren't yet solved when the components are different.

image becomes image

zaneselvans commented 1 year ago

Is it possible that these two calculations should really represent the same quantity, but broken down in different ways? It looks like the nuclear fuel assemblies are lumped together in 202 but broken out into in vs. out of the reactor in 110. Maybe there's another nested calculation for nuclear fuel materials and assemblies that refers to both the in and out of reactor assemblies, so if we exploded that value the two calculations would be the same?

e-belfer commented 1 year ago

balance_sheet_assets_ferc1 As noted above, I removed components from Schedule 202 calculations from nuclear_fuel_net

I also found our first violations of the "all 2021 calculations extend backwards into time" assumption. All the errors in nuclear_fuel_net occur prior to 2004, which reflects a change in the calculation used to derive this variable. Prior to 2004, nuclear_fuel_net is calculated as the difference between these two variables (as seen here in our dbf_to_xbrl.csv): image image Indeed, adding this component into the calculation fixes the issue for these older calculations (~600 records fixed).

Same with special_funds_all, which is replaced by other_special_funds and special_funds post 2004 (~1200 records fixed).

From over 70% of records initially, implementing these fixes results in 252 records (1.43% of calculated fields) that do not calculate correctly. 73 of these have a >5% error.

balance_sheet_liabilities_ferc1 Added accumulated_deferred_income_taxes into the deferred_credits calculation (another pre-2004 component). This results in 76 (0.56%) incorrect calculations remaining.

zaneselvans commented 1 year ago

Ah, I think we might have had to fix a simple version of one of these already -- where the way the values are broken down had changed over the years, by creating a new xbrl_factoid to associate with the old DBF only values, and then incorporated that factoid into the calculation definition. Not sure what the table was though.

e-belfer commented 1 year ago

In this instance, since the factoid already is defined in dbf_to_xbrl.csv and only exists pre-2004, is there any harm in just adding it directly into the calculation? That's what I went ahead and did.

zaneselvans commented 1 year ago

That seems totally reasonable.

jrea-rmi commented 1 year ago

@cmgosnell highest priority phase 2 tasks from my perspective:

2606, #2622, #2623 seem required for #2624 for creating an exploded table then #2625 for adding labels that we'd then use.

then #2596 and #2600 can be done later on