(ferc 💥 metadata) create xbrl_factoid metadata containing tree-like provenance

cmgosnell commented 1 year ago

investigate storage/translation of tree-like nature of these nested calculations/relationships. (Simplest current solution is to store the relationships between tables/fields in an arbitrarily nested dictionary. We'd use that dict to take a table or field and convert it into a table with all of its sub-components w/ calculated values identified -> calced using sub-components for validation -> replaced w/ sub-components)

for each xbrl_factoid:

is it a part of a calculated field? if so which field & source table
is that calculated field a part of a calculated field? if so which field & source table
etc etc and on down.

Design Notes

The basic unit is a "calculation" which is a list of JSON objects, each of which contains a "name" (string) and a "weight" (number) field.
Each "name" can refer either to a directly reported value or to a calculated value, in which case that (name+weight) can be expanded into a list of other name+weight fields.
This process can be done recursively until you have a list of only directly reported field names and their weights.
The same field can potentially show up with different weights in different calculations, so we need to make sure we're looking up the correct calculation when expanding.
Knowing which calculation is the "correct" calculation means knowing which table the field is being read from / provided by, so we also need the source table for the field?
The place to look up these calculation specifications is the processed tabular metadata for each of the referenced tables.
We may encounter calculated fields from tables that we haven't yet processed, which means we won't be able to look up the associated calculation -- or even if we could look it up, it might be wrong, since many of them have missing or extra elements or sign errors.
This data structure does not need to worry about fixing errors in the calculation metadata, or the application of spot fixes to the data that's being used to do the calculations. It should assume that the specified calculations are correct, and the data is correct -- those issues will be dealt with elsewhere.

Questions

What class will be responsible for applying a specified calculation, given a dataframe?
What is the required structure of the input dataframe to which the calculation should be applied?
Do the calculations actually need to be applied to a dataframe? Or is the application of the calculations only used for verifying that the calculations / reported values are correct?
What are some of the methods the calculation / calculation tree needs to have?
What does it mean that source_table is a list of table names, rather than a single table name?
Is there any reason not to make all source tables explicit, so that a calculation object can stand on its own?

Example calculation (JSON):

```json [ { "name": "utility_plant", "weight": 1.0, "source_table": [ "balance_sheet_assets_ferc1" ] }, { "name": "utility_plant_in_service_classified_and_unclassified", "weight": 1.0 }, { "name": "utility_plant_leased_to_others", "weight": 1.0 }, { "name": "utility_plant_held_for_future_use", "weight": 1.0 }, { "name": "construction_work_in_progress", "weight": 1.0 }, { "name": "utility_plant_acquisition_adjustment", "weight": 1.0 }, { "name": "utility_plant_and_construction_work_in_progress_correction", "weight": 1.0 } ] ```

Classes / Methods

FercXbrlCalculation a data class built using Pydantic.

Notes from Call w/ CG

Given the income statement table, what are all of the underlying tables / fields and how do they relate to each other.
"Give me a list of table names that you want to explode and return a nested structure that describes the relationships between the facts contained in those tables."
"For any given fact, tell me all the larger facts that it's part of, back up the tree."
Need to look at all parents for a fact, and then be able to look up tags or other metadata associated with each of those parents, and associate them with the child node.
There's also a column that says whether the referenced calculation is inter- or intra- calculation.
Need to know both what table the calculation is "stored" within, and what table the individual calculation components are coming from, and also what XBRL fact is being calculated.
Could be useful to take the list of related tables, grab all their metadata dataframes, concatenate them together (adding a table name) and use that as the source.
References to facts in unprocessed tables will have their ORIGINAL name, not the renamed name, because the renames are based on our transformer info.
We may need to also keep the original names around in the calculations, so that we can tell if we're referring to outside data. Could just keep "original_name" in there across the board so we always have it.
If we change the calculation definitions remember we need to add those into the calculation fixes or change how it identifies which components need to be modified by a fix.
Balance Statement tree has more inter-table references that are already present (originally reported), so start with that one.
If a single reported granular fact appears in more than one table, do we need to have information about both of the places it is reported? Or is one of them the correct one? Right now all examples of this that we have are a single value being reported in one table, and passed up to another table as a single value (kind of like a degenerate calculation with only a single value?)
```
### Tasks
```
[x] Add correction factoids to processed metadata
[x] Enable recursive resolution of nested calculation components
[x] Enable addition of new metadata columns based on each record's calculation tree
[x] Add name_original to calculation components in process_xbrl_metadata()
[x] Build CalculationForest based on list of facts and identify disconnected trees
[x] Recalculate weights of leaf nodes as product of weights of all parent nodes
[x] Given a CalculationForest, generate a dataframe indexed by leaf-facts containing leaf-metadata
[x] For each root node in a CalculationForest generate an XBRL style calculation for validation
[x] Inject missing special_funds_all & nuclear_fuel metadata inside of process_xbrl_metadata()
[x] Fix non-tree structure found in the balance_sheet_assets_ferc1 calculation metadata.
[x] Add table specific seed nodes to explosion assets
[x] Apply metadata tags in exploded asset factory
[x] Drop extraneous columns from exploded data output

zaneselvans commented 1 year ago

Raw materials we're working with is the metadataframes that come out of the table transformers, especially the calculations column.

Output that we want in the end is

Build the nested lineage data structure based on calculation columns, given a set of exploded tables.
Merge lineage based tags into the exploded metadata table.
Compiling tagging data (relates XBRL factoids/lineage to other categories).
Merge the exploded metadata table w/ tags onto the full dataframe.

Start with balance_sheet_assets_ferc1 table because it has a lot of calculations, both inter and intra-table. Whether a calculation is within or between tables can be derived entirely from information contained in the calculation.

Calculation exists independent of any table or factoid -- doesn't say what factoid it's calculating or what table that calculation is within. In the exploded metadata table we'll have both source_table and xbrl_factoid in their own columns

zaneselvans commented 1 year ago

What does it mean when a given calculation component has multiple source tables? After discussing with @cmgosnell it seems like there are at least two options. If they are both happening, then we need to be able to differentiate between them and treat them differently in constructing the calculation trees.

Multiple values in source_tables could:

...refer to several instances of the same value being reported in different places. In this case the values are duplicative, and we would want to verify that they're all identical, and then only choose one of them in constructing the calculation tree. In this case we could remove all but one of the values in source_tables and end up with only a single source_table that we use to look up the calculation component.
...refer to different portions of the same underlying value being reported in different tables. E.g. if capital expenditures being reported in various different plant tables, and are referring to the portions of capital expenditure that are attributable to those individual plants, we would then need to sum up the contributions from each of the different plants. In this case all of the listed source tables would be part of the lineage of the calculated value. In this case we could split a calculation component with several source_tables into several individual calculation components that each have a single, distinct value of source_table.

In general it seems like when referring to a calculation component we need to include both a source table and the fact name.

zaneselvans commented 1 year ago

We've added xbrl_factoid_correction elements to the calculations (always) and into the data tables themselves (when there's a correction required), but we have NOT added them into the processed metadata, which is resulting in an inconsistency.

For each calculated xbrl_factoid in the processed metadata, add another identical record but xbrl_factoid_correction + a null calculation "[]"

@cmgosnell says it will be difficult to add the name_original element into the calculation components:

could happen all the way upstream when we make clean_xbrl_metadata_json -- and then change some of them later.
or it could happen in process_xbrl_metadata() right before / during the renaming of the calculation component names.
however right now the calculation fixes that we've manually compiled have to match exactly the elements they are replacing. Probably it only needs to know the name element to identify the correct calculation to change.
The replacement calculation that we're inserting obviously needs to be complete.
Overall this calculation fixing process needs to be better parameterized.

Identifying unconnected components of the tree:

Portions of the granular tables actually don't feed into the root of the table -- they're dead ends.
We need to be able to remove the portions of the exploded tables that aren't actually referenced by the root table either directly or indirectly. We may also want to be more flexible about which factoids are included in the final explosion (i.e. some of the records in the root table might also be uninteresting).
Should be doable using connected components, given a set of seed factoids that are interesting.
This will probably be another "removal" step in the ExplodeData class.

Removing duplicate facts

Hacked "leafiness" rating based on what "level" a whole table is at.
Need to come up with a better way of identifying which of a couple of duplicate xbrl factoids we should choose.
Probably there are already tree / graph metrics that already encapsulate this information which we can just apply.

zaneselvans commented 1 year ago

I've added some draft XBRL calculation tree infrastructure in #2653. Below is an example of how to use it.

I'm doing something wrong in the recursive resolution of the calculation components. Probably it's that I'm editing the calculation in place rather than returning a new calculation in the Ferc1XbrlCalculation.resolve() method.

from pudl.output.ferc1 import MetadataExploder
from pudl.etl import defs
import json
from pudl.output.ferc1 import Ferc1XbrlCalculation, Ferc1XbrlCalculationComponent

xbrl_meta = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))
balance_sheet_asset_tables = [
    "balance_sheet_assets_ferc1",
    "utility_plant_summary_ferc1",
    "plant_in_service_ferc1",
]
exploded_meta = (
    MetadataExploder(balance_sheet_asset_tables).boom(xbrl_meta)
)[[
    "table_name",
    "xbrl_factoid",
    "calculations",
    "intra_table_calc_flag",
    "xbrl_factoid_original",
]]

calc = Ferc1XbrlCalculation.from_exploded_meta(
    exploded_meta=exploded_meta,
    table_name="balance_sheet_assets_ferc1",
    xbrl_factoid="utility_plant_net",
)
print(json.dumps(calc.dict(), indent=4))

# Ahhhh, I think this needs to return the resolved calculation in the recursion...
calc.resolve(exploded_meta=exploded_meta)
print(json.dumps(calc.dict(), indent=4))

Original Calculation

{
    "calculations": [
        {
            "name": "utility_plant_and_construction_work_in_progress",
            "weight": 1.0,
            "source_tables": [
                "balance_sheet_assets_ferc1"
            ],
            "calculation": null
        },
        {
            "name": "accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility",
            "weight": -1.0,
            "source_tables": [
                "balance_sheet_assets_ferc1"
            ],
            "calculation": null
        },
        {
            "name": "utility_plant_net_correction",
            "weight": 1.0,
            "source_tables": [
                "balance_sheet_assets_ferc1"
            ],
            "calculation": null
        }
    ],
    "source_table": "balance_sheet_assets_ferc1",
    "xbrl_factoid": "utility_plant_net",
    "xbrl_factoid_original": "utility_plant_net"
}

(Partially) Resolved Calculation

big JSON blob here

```json { "calculations": [ { "name": "utility_plant_and_construction_work_in_progress", "weight": 1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": { "calculations": [ { "name": "utility_plant", "weight": 1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": null }, { "name": "utility_plant_in_service_classified_and_unclassified", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "utility_plant_leased_to_others", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "utility_plant_held_for_future_use", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "construction_work_in_progress", "weight": 1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": null }, { "name": "utility_plant_acquisition_adjustment", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "utility_plant_and_construction_work_in_progress_correction", "weight": 1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": null } ], "source_table": "balance_sheet_assets_ferc1", "xbrl_factoid": "utility_plant_and_construction_work_in_progress", "xbrl_factoid_original": "utility_plant_and_construction_work_in_progress" } }, { "name": "accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility", "weight": -1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": { "calculations": [ { "name": "depreciation_utility_plant_in_service", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "amortization_and_depletion_of_producing_natural_gas_land_and_land_rightsutility_plant_in_service", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "amortization_of_underground_storage_land_and_land_rightsutility_plant_in_service", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "amortization_of_other_utility_plant_utility_plant_in_service", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "depreciation_amortization_and_depletion_utility_plant_in_service", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "depreciation_amortization_and_depletion_utility_plant_leased_to_others", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "depreciation_and_amortization_utility_plant_held_for_future_use", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "abandonment_of_leases", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "amortization_of_plant_acquisition_adjustment", "weight": 1.0, "source_tables": [ "utility_plant_summary_ferc1" ], "calculation": null }, { "name": "accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility_correction", "weight": 1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": null } ], "source_table": "balance_sheet_assets_ferc1", "xbrl_factoid": "accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility", "xbrl_factoid_original": "accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility" } }, { "name": "utility_plant_net_correction", "weight": 1.0, "source_tables": [ "balance_sheet_assets_ferc1" ], "calculation": { "calculations": [], "source_table": "balance_sheet_assets_ferc1", "xbrl_factoid": "utility_plant_net_correction", "xbrl_factoid_original": "utility_plant_net" } } ], "source_table": "balance_sheet_assets_ferc1", "xbrl_factoid": "utility_plant_net", "xbrl_factoid_original": "utility_plant_net" } ```

zaneselvans commented 1 year ago

Notes from chat with @cmgosnell:

Currently Exploder.boom() does some "pruning" of the metadata to only keep the "leafiest" nodes. But this should probably be done in the ExplodeMeta.
Setting boundaries around the exploded table should also be done in ExplodeMeta
Should be able to give the forest builder a list of factoids and have it identify the unique root nodes in there, and return just the trees required to reproduce those facts (could be either a human generated list or all the facts in a table)
What should we do about references to tables outside the exploded tables? These are rare, and we've "cleaned up" a lot of them.
There are 3 kinds -- all "in the table" all "another table" and a mix of tables (which we have "fixed" to be single-table)
Seems like any reference to an outside table in a calculation means we should drop the calculation and turn it into a "[]" which will naturally prune the referenced facts both in and outside of the exploded tables, which is the desired behavior.
What are the top-level classes? Right now we have a Data Exploder which constructs a Metadata Exploder.

zaneselvans commented 1 year ago

I've added some code in #2653 that allows the generation of a "leafy" calculation tree for a calculation forest with several different root factoids... using it looks like this right now:

from dagster import AssetKey

from pudl.etl import defs
from pudl.output.ferc1 import MetadataExploder, XbrlCalculationForestFerc1

xbrl_meta = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))

balance_sheet_asset_tables = [
    "balance_sheet_assets_ferc1",
    "utility_plant_summary_ferc1",
    "plant_in_service_ferc1",
]

meta_tags = pd.DataFrame(
    columns=["table_name", "xbrl_factoid", "is_ratebase", "utility_function"],
    data=[
        ("utility_plant_summary_ferc1", "depreciation_and_amortization_utility_plant_held_for_future_use", True, "electric"),
    ],
).convert_dtypes().set_index(["table_name", "xbrl_factoid"])

exploded_meta = MetadataExploder(balance_sheet_asset_tables).boom(xbrl_meta)

forest = XbrlCalculationForestFerc1.from_exploded_meta(
    source_tables=["balance_sheet_assets_ferc1", "balance_sheet_assets_ferc1"],
    xbrl_factoids=["utility_plant_net", "deferred_debits"],
    exploded_meta=exploded_meta.set_index(["table_name", "xbrl_factoid"]),
    propagate_weights=True,
    tags_df=meta_tags,
)

leafy_meta = forest.to_leafy_meta()

zaneselvans commented 1 year ago

After re-implementing this to use NetworkX for all the graph stuff... there seem to be some non-tree relationships encoded in the calculations.


all_nodes = list(exploded_meta.set_index(["table_name", "xbrl_factoid"]).index)

new_forest = NewXbrlCalcuationForestFerc1(
    exploded_meta=exploded_meta,
    seeds=all_nodes,
    tags=meta_tags,
)

new_nx_forest = new_forest.nx_forest
new_leafy_meta = new_forest.leafy_meta
new_root_calcs = new_forest.root_calculations

from networkx.drawing.nx_agraph import graphviz_layout
pos = graphviz_layout(new_nx_forest, prog="dot", args='-Grankdir="LR"')
#nx.draw_networkx(new_nx_forest, pos)
nx.draw_networkx_nodes(new_nx_forest, pos)
nx.draw_networkx_edges(new_nx_forest, pos)
plt.show()

multi_parents = [n for n, in_deg in new_nx_forest.in_degree() if in_deg > 1]

[NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_and_depletion_of_producing_natural_gas_land_and_land_rightsutility_plant_in_service'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_underground_storage_land_and_land_rightsutility_plant_in_service'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='depreciation_amortization_and_depletion_utility_plant_leased_to_others'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='depreciation_and_amortization_utility_plant_held_for_future_use'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_plant_acquisition_adjustment'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='utility_plant'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_leased_to_others'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='noncurrent_portion_of_allowances'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='derivative_instrument_assets_long_term'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='derivative_instrument_assets_hedges_long_term')]

zaneselvans commented 1 year ago

Fixing multi-parent nodes

Remove assertion that source_tables needs to contain only a single element, after merging in changes from #2701
Merge in the calculation re-directs that @e-belfer has added in #2669 to simplify the cases in which the same fact is showing up in multiple locations.
Add a step in the tree construction that replaces any node that has only a single parent and a single child with an edge directly connecting the parent and child. This should turn the DAG into a forest.

zaneselvans commented 1 year ago

Weirdly, it turns out that using the utility_summary_ferc1 and balance_sheet_assets_ferc1 facts as seeds for the forest results in exactly the same set of nodes:

from dagster import AssetKey

import networkx as nx
import json

from pudl.etl import defs
from pudl.output.ferc1 import MetadataExploder, NodeId, XbrlCalculationForestFerc1

xbrl_meta = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))

balance_sheet_asset_tables = [
    "balance_sheet_assets_ferc1",
    "utility_plant_summary_ferc1",
    "plant_in_service_ferc1",
]

# NOTE: there are a bunch of duplicate records in xbrl_factoid_rate_base_tags.csv
pkg_source = (
    importlib.resources.files("pudl.package_data.ferc1")
    .joinpath("xbrl_factoid_rate_base_tags.csv")
)
with importlib.resources.as_file(pkg_source) as tags_csv:
    in_rate_base = (
        pd.read_csv(tags_csv, usecols=["xbrl_factoid", "table_name", "in_rate_base"])
        .drop_duplicates(subset=["table_name", "xbrl_factoid"])
    )

exploded_meta = MetadataExploder(balance_sheet_asset_tables).boom(xbrl_meta)

pis_seeds = list(
    exploded_meta[exploded_meta.table_name == "plant_in_service_ferc1"]
    .set_index(["table_name", "xbrl_factoid"]).index
)
ups_seeds = list(
    exploded_meta[exploded_meta.table_name == "utility_plant_summary_ferc1"]
    .set_index(["table_name", "xbrl_factoid"]).index
)
bsa_seeds = list(
    exploded_meta[exploded_meta.table_name == "balance_sheet_assets_ferc1"]
    .set_index(["table_name", "xbrl_factoid"]).index
)

bs_forest = XbrlCalculationForestFerc1(
    exploded_meta=exploded_meta,
    tags=in_rate_base
)
pis_forest = XbrlCalculationForestFerc1(
    exploded_meta=exploded_meta,
    seeds=pis_seeds,
    tags=in_rate_base
)
bsa_forest = XbrlCalculationForestFerc1(
    exploded_meta=exploded_meta,
    seeds=bsa_seeds,
    tags=in_rate_base
)
ups_forest = XbrlCalculationForestFerc1(
    exploded_meta=exploded_meta,
    seeds=ups_seeds,
    tags=in_rate_base
)

assert ups_forest.nx_forest.nodes == bsa_forest.nx_forest.nodes

There are only 6 nodes with calculations that involve other nodes that show up in more than one calculation, and they are:

[NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='other_property_and_investments'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='current_and_accrued_assets'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='utility_plant_and_construction_work_in_progress'),
 NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'),
 NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_and_construction_work_in_progress')]

I think the major culprits are the accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility and utility_plant_and_construction_work_in_progress factoids, which show up in both balance_sheet_assets_ferc1 and utility_plant_summary_ferc1 and seem to contain entirely duplicated calculations.

Here's a dictionary that maps NodeId to calculations, for the nodes that are "bad parents" (with calculations that involve duplicated facts). I'm not sure if this is the right / enough information to figure out if we can eliminate the duplication with passthrough calculations though. @e-belfer

```py {NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='other_property_and_investments'): [{'name': 'nonutility_property', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'accumulated_provision_for_depreciation_and_amortization_of_nonutility_property', 'weight': -1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'investment_in_associated_companies', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'investment_in_subsidiary_companies', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'noncurrent_portion_of_allowances', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'other_investments', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'sinking_funds', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'depreciation_fund', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'amortization_fund_federal', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'other_special_funds', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'special_funds', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'derivative_instrument_assets_long_term', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'derivative_instrument_assets_hedges_long_term', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'special_funds_all', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'other_property_and_investments_correction', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}], NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'): [{'name': 'depreciation_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_and_depletion_of_producing_natural_gas_land_and_land_rights_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_of_underground_storage_land_and_land_rights_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_of_other_utility_plant_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'depreciation_amortization_and_depletion_utility_plant_leased_to_others', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'depreciation_and_amortization_utility_plant_held_for_future_use', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'abandonment_of_leases', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_of_plant_acquisition_adjustment', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility_correction', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}], NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='current_and_accrued_assets'): [{'name': 'cash_and_working_funds', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'cash', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'special_deposits', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'working_funds', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'temporary_cash_investments', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'notes_receivable', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'customer_accounts_receivable', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'other_accounts_receivable', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'accumulated_provision_for_uncollectible_accounts_credit', 'weight': -1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'notes_receivable_from_associated_companies', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'accounts_receivable_from_associated_companies', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'fuel_stock', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'fuel_stock_expenses_undistributed', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'residuals', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'plant_materials_and_operating_supplies', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'merchandise', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'other_materials_and_supplies', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'nuclear_materials_held_for_sale', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'allowance_inventory_and_withheld', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'noncurrent_portion_of_allowances', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'stores_expense_undistributed', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'gas_stored_current', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'liquefied_natural_gas_stored_and_held_for_processing', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'prepayments', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'advances_for_gas', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'interest_and_dividends_receivable', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'rents_receivable', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'accrued_utility_revenues', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'miscellaneous_current_and_accrued_assets', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'derivative_instrument_assets', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'derivative_instrument_assets_long_term', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'derivative_instrument_assets_hedges', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'derivative_instrument_assets_hedges_long_term', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'current_and_accrued_assets_correction', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}], NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='utility_plant_and_construction_work_in_progress'): [{'name': 'utility_plant', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'utility_plant_in_service_classified_and_unclassified', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_leased_to_others', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_held_for_future_use', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'construction_work_in_progress', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'utility_plant_acquisition_adjustment', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_and_construction_work_in_progress_correction', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}], NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'): [{'name': 'depreciation_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_and_depletion_of_producing_natural_gas_land_and_land_rights_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_of_underground_storage_land_and_land_rights_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_of_other_utility_plant_utility_plant_in_service', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'depreciation_amortization_and_depletion_utility_plant_leased_to_others', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'depreciation_and_amortization_utility_plant_held_for_future_use', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'abandonment_of_leases', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'amortization_of_plant_acquisition_adjustment', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility_correction', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}], NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_and_construction_work_in_progress'): [{'name': 'utility_plant', 'weight': 1.0, 'source_tables': ['balance_sheet_assets_ferc1']}, {'name': 'utility_plant_in_service_classified_and_unclassified', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_leased_to_others', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_held_for_future_use', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'construction_work_in_progress', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_acquisition_adjustment', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}, {'name': 'utility_plant_and_construction_work_in_progress_correction', 'weight': 1.0, 'source_tables': ['utility_plant_summary_ferc1']}]} ```

zaneselvans commented 1 year ago

Maybe passthrough calculations can't fix this problem? Is the real problem that we truly have the same money being reported in two places: the utility_plant_summary_ferc1 table and the balance_sheet_assets_ferc1 table?

I imagine that the utility_plant_summary_ferc1 table is also hooked up to the plant_in_service_ferc1 table with an interdimensional utility_type calculation.

e-belfer commented 1 year ago

I think the major culprits are the accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility and utility_plant_and_construction_work_in_progress factoids, which show up in both balance_sheet_assets_ferc1 and utility_plant_summary_ferc1 and seem to contain entirely duplicated calculations.

I'm not 100% following everything above, but I can confirm that both of these factoids are reported to be identical in both places in the yeti metadata. What if we drop the duplicate calculations from the root table and reassign the value from utility plant summary to point at the balance sheet assets one? Is there a way to distinguish between the factoid in its two locations?

E.g.: utility_plant_construction_in_progress (balance sheet assets) = utility_plant_construction_in_progress (utility plant summary) = components.

zaneselvans commented 1 year ago

Ahhhhh, looking at the data tables and the forms, I think I understand more why this is showing in up in two places. Sorry if this was already obvious to you. The utility_plant_summary_ferc1 (UPS) table has another dimension (utility_type) that breaks down all of these accounts into more granular electric, gas, & other categories, while the balance_sheet_assets_ferc1 (BSA) table only has the starting & ending balances for each of the factoids, totaling across all utility types.

So it seems like the redirection that you're suggesting (from BSA to UPS) would preserve the utility-type information that's present in the other dimension, and link the totals in the BSA table to the totals of all utility types in the UPS table, which would be great!

zaneselvans commented 1 year ago

After merging in the interdimensional branch I'm now encountering two new issues with building the calculation trees:

In a couple of cases, there appears to be a conflict between weights for a given calculation, where from one source it's 1.0 and in another it's -1.0.

2023-06-29 11:59:55 [   ERROR] catalystcoop.pudl.output.ferc1:1793 Calculation weights do not match for NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'):1.0 != -1.0
2023-06-29 11:59:55 [   ERROR] catalystcoop.pudl.output.ferc1:1793 Calculation weights do not match for NodeId(source_table='plant_in_service_ferc1', xbrl_factoid='electric_plant_sold'):1.0 != -1.0

Should electric_plant_sold have a weight of -1.0 instead? I guess it shows up in one place with +1.0 and another with -1.0. Maybe the expectation I'm asserting is not correct, but it also wasn't failing before.

There seem to be some new inconsistencies between the exploded metadata and the calculations, including a couple of cases where the unfixed rightsutility string shows up rather than rights_utility and another one with the CWIP:

KeyError: "[
    ('utility_plant_summary_ferc1', 'amortization_of_underground_storage_land_and_land_rightsutility_plant_in_service'),
    ('utility_plant_summary_ferc1', 'amortization_and_depletion_of_producing_natural_gas_land_and_land_rightsutility_plant_in_service'),
    ('utility_plant_summary_ferc1', 'utility_plant_construction_work_in_progress')
] not in index"

It seems like the first two are probably just the rightsutility fix somehow not being propagated into either the calculations or the metadata, but I don't understand why the CWIP factoid would have gone missing. I guess this is the disappearing metadata you were talking about.

cmgosnell commented 1 year ago

closing bc we merged #2653 into explode_ferc1

catalyst-cooperative / pudl