Closed cmgosnell closed 1 year ago
Raw materials we're working with is the metadataframes that come out of the table transformers, especially the calculations column.
Output that we want in the end is
Start with balance_sheet_assets_ferc1
table because it has a lot of calculations, both inter and intra-table. Whether a calculation is within or between tables can be derived entirely from information contained in the calculation.
Calculation exists independent of any table or factoid -- doesn't say what factoid it's calculating or what table that calculation is within. In the exploded metadata table we'll have both source_table
and xbrl_factoid
in their own columns
What does it mean when a given calculation component has multiple source tables? After discussing with @cmgosnell it seems like there are at least two options. If they are both happening, then we need to be able to differentiate between them and treat them differently in constructing the calculation trees.
Multiple values in source_tables
could:
source_tables
and end up with only a single source_table
that we use to look up the calculation component.source_tables
into several individual calculation components that each have a single, distinct value of source_table
.In general it seems like when referring to a calculation component we need to include both a source table and the fact name.
We've added xbrl_factoid_correction
elements to the calculations (always) and into the data tables themselves (when there's a correction required), but we have NOT added them into the processed metadata, which is resulting in an inconsistency.
For each calculated xbrl_factoid in the processed metadata, add another identical record but xbrl_factoid_correction
+ a null calculation "[]"
@cmgosnell says it will be difficult to add the name_original
element into the calculation components:
clean_xbrl_metadata_json
-- and then change some of them later.process_xbrl_metadata()
right before / during the renaming of the calculation component names.name
element to identify the correct calculation to change.I've added some draft XBRL calculation tree infrastructure in #2653. Below is an example of how to use it.
I'm doing something wrong in the recursive resolution of the calculation components. Probably it's that I'm editing the calculation in place rather than returning a new calculation in the Ferc1XbrlCalculation.resolve()
method.
from pudl.output.ferc1 import MetadataExploder
from pudl.etl import defs
import json
from pudl.output.ferc1 import Ferc1XbrlCalculation, Ferc1XbrlCalculationComponent
xbrl_meta = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))
balance_sheet_asset_tables = [
"balance_sheet_assets_ferc1",
"utility_plant_summary_ferc1",
"plant_in_service_ferc1",
]
exploded_meta = (
MetadataExploder(balance_sheet_asset_tables).boom(xbrl_meta)
)[[
"table_name",
"xbrl_factoid",
"calculations",
"intra_table_calc_flag",
"xbrl_factoid_original",
]]
calc = Ferc1XbrlCalculation.from_exploded_meta(
exploded_meta=exploded_meta,
table_name="balance_sheet_assets_ferc1",
xbrl_factoid="utility_plant_net",
)
print(json.dumps(calc.dict(), indent=4))
# Ahhhh, I think this needs to return the resolved calculation in the recursion...
calc.resolve(exploded_meta=exploded_meta)
print(json.dumps(calc.dict(), indent=4))
{
"calculations": [
{
"name": "utility_plant_and_construction_work_in_progress",
"weight": 1.0,
"source_tables": [
"balance_sheet_assets_ferc1"
],
"calculation": null
},
{
"name": "accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility",
"weight": -1.0,
"source_tables": [
"balance_sheet_assets_ferc1"
],
"calculation": null
},
{
"name": "utility_plant_net_correction",
"weight": 1.0,
"source_tables": [
"balance_sheet_assets_ferc1"
],
"calculation": null
}
],
"source_table": "balance_sheet_assets_ferc1",
"xbrl_factoid": "utility_plant_net",
"xbrl_factoid_original": "utility_plant_net"
}
Notes from chat with @cmgosnell:
"[]"
which will naturally prune the referenced facts both in and outside of the exploded tables, which is the desired behavior.I've added some code in #2653 that allows the generation of a "leafy" calculation tree for a calculation forest with several different root factoids... using it looks like this right now:
from dagster import AssetKey
from pudl.etl import defs
from pudl.output.ferc1 import MetadataExploder, XbrlCalculationForestFerc1
xbrl_meta = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))
balance_sheet_asset_tables = [
"balance_sheet_assets_ferc1",
"utility_plant_summary_ferc1",
"plant_in_service_ferc1",
]
meta_tags = pd.DataFrame(
columns=["table_name", "xbrl_factoid", "is_ratebase", "utility_function"],
data=[
("utility_plant_summary_ferc1", "depreciation_and_amortization_utility_plant_held_for_future_use", True, "electric"),
],
).convert_dtypes().set_index(["table_name", "xbrl_factoid"])
exploded_meta = MetadataExploder(balance_sheet_asset_tables).boom(xbrl_meta)
forest = XbrlCalculationForestFerc1.from_exploded_meta(
source_tables=["balance_sheet_assets_ferc1", "balance_sheet_assets_ferc1"],
xbrl_factoids=["utility_plant_net", "deferred_debits"],
exploded_meta=exploded_meta.set_index(["table_name", "xbrl_factoid"]),
propagate_weights=True,
tags_df=meta_tags,
)
leafy_meta = forest.to_leafy_meta()
After re-implementing this to use NetworkX for all the graph stuff... there seem to be some non-tree relationships encoded in the calculations.
all_nodes = list(exploded_meta.set_index(["table_name", "xbrl_factoid"]).index)
new_forest = NewXbrlCalcuationForestFerc1(
exploded_meta=exploded_meta,
seeds=all_nodes,
tags=meta_tags,
)
new_nx_forest = new_forest.nx_forest
new_leafy_meta = new_forest.leafy_meta
new_root_calcs = new_forest.root_calculations
from networkx.drawing.nx_agraph import graphviz_layout
pos = graphviz_layout(new_nx_forest, prog="dot", args='-Grankdir="LR"')
#nx.draw_networkx(new_nx_forest, pos)
nx.draw_networkx_nodes(new_nx_forest, pos)
nx.draw_networkx_edges(new_nx_forest, pos)
plt.show()
multi_parents = [n for n, in_deg in new_nx_forest.in_degree() if in_deg > 1]
[NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='depreciation_utility_plant_in_service'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_and_depletion_of_producing_natural_gas_land_and_land_rightsutility_plant_in_service'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_underground_storage_land_and_land_rightsutility_plant_in_service'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_other_utility_plant_utility_plant_in_service'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='depreciation_amortization_and_depletion_utility_plant_leased_to_others'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='depreciation_and_amortization_utility_plant_held_for_future_use'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='abandonment_of_leases'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='amortization_of_plant_acquisition_adjustment'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='utility_plant'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_in_service_classified_and_unclassified'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_leased_to_others'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_held_for_future_use'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_acquisition_adjustment'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='noncurrent_portion_of_allowances'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='derivative_instrument_assets_long_term'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='derivative_instrument_assets_hedges_long_term')]
source_tables
needs to contain only a single element, after merging in changes from #2701 Weirdly, it turns out that using the utility_summary_ferc1
and balance_sheet_assets_ferc1
facts as seeds for the forest results in exactly the same set of nodes:
from dagster import AssetKey
import networkx as nx
import json
from pudl.etl import defs
from pudl.output.ferc1 import MetadataExploder, NodeId, XbrlCalculationForestFerc1
xbrl_meta = defs.load_asset_value(AssetKey("clean_xbrl_metadata_json"))
balance_sheet_asset_tables = [
"balance_sheet_assets_ferc1",
"utility_plant_summary_ferc1",
"plant_in_service_ferc1",
]
# NOTE: there are a bunch of duplicate records in xbrl_factoid_rate_base_tags.csv
pkg_source = (
importlib.resources.files("pudl.package_data.ferc1")
.joinpath("xbrl_factoid_rate_base_tags.csv")
)
with importlib.resources.as_file(pkg_source) as tags_csv:
in_rate_base = (
pd.read_csv(tags_csv, usecols=["xbrl_factoid", "table_name", "in_rate_base"])
.drop_duplicates(subset=["table_name", "xbrl_factoid"])
)
exploded_meta = MetadataExploder(balance_sheet_asset_tables).boom(xbrl_meta)
pis_seeds = list(
exploded_meta[exploded_meta.table_name == "plant_in_service_ferc1"]
.set_index(["table_name", "xbrl_factoid"]).index
)
ups_seeds = list(
exploded_meta[exploded_meta.table_name == "utility_plant_summary_ferc1"]
.set_index(["table_name", "xbrl_factoid"]).index
)
bsa_seeds = list(
exploded_meta[exploded_meta.table_name == "balance_sheet_assets_ferc1"]
.set_index(["table_name", "xbrl_factoid"]).index
)
bs_forest = XbrlCalculationForestFerc1(
exploded_meta=exploded_meta,
tags=in_rate_base
)
pis_forest = XbrlCalculationForestFerc1(
exploded_meta=exploded_meta,
seeds=pis_seeds,
tags=in_rate_base
)
bsa_forest = XbrlCalculationForestFerc1(
exploded_meta=exploded_meta,
seeds=bsa_seeds,
tags=in_rate_base
)
ups_forest = XbrlCalculationForestFerc1(
exploded_meta=exploded_meta,
seeds=ups_seeds,
tags=in_rate_base
)
assert ups_forest.nx_forest.nodes == bsa_forest.nx_forest.nodes
There are only 6 nodes with calculations that involve other nodes that show up in more than one calculation, and they are:
[NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='other_property_and_investments'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='current_and_accrued_assets'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='utility_plant_and_construction_work_in_progress'),
NodeId(source_table='balance_sheet_assets_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'),
NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='utility_plant_and_construction_work_in_progress')]
I think the major culprits are the accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility
and utility_plant_and_construction_work_in_progress
factoids, which show up in both balance_sheet_assets_ferc1
and utility_plant_summary_ferc1
and seem to contain entirely duplicated calculations.
Here's a dictionary that maps NodeId to calculations, for the nodes that are "bad parents" (with calculations that involve duplicated facts). I'm not sure if this is the right / enough information to figure out if we can eliminate the duplication with passthrough calculations though. @e-belfer
Maybe passthrough calculations can't fix this problem? Is the real problem that we truly have the same money being reported in two places: the utility_plant_summary_ferc1
table and the balance_sheet_assets_ferc1
table?
I imagine that the utility_plant_summary_ferc1
table is also hooked up to the plant_in_service_ferc1
table with an interdimensional utility_type
calculation.
I think the major culprits are the accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility and utility_plant_and_construction_work_in_progress factoids, which show up in both balance_sheet_assets_ferc1 and utility_plant_summary_ferc1 and seem to contain entirely duplicated calculations.
I'm not 100% following everything above, but I can confirm that both of these factoids are reported to be identical in both places in the yeti metadata. What if we drop the duplicate calculations from the root table and reassign the value from utility plant summary to point at the balance sheet assets one? Is there a way to distinguish between the factoid in its two locations?
E.g.: utility_plant_construction_in_progress (balance sheet assets) = utility_plant_construction_in_progress (utility plant summary) = components.
Ahhhhh, looking at the data tables and the forms, I think I understand more why this is showing in up in two places. Sorry if this was already obvious to you. The utility_plant_summary_ferc1
(UPS) table has another dimension (utility_type
) that breaks down all of these accounts into more granular electric, gas, & other categories, while the balance_sheet_assets_ferc1
(BSA) table only has the starting & ending balances for each of the factoids, totaling across all utility types.
So it seems like the redirection that you're suggesting (from BSA to UPS) would preserve the utility-type information that's present in the other dimension, and link the totals in the BSA table to the totals of all utility types in the UPS table, which would be great!
After merging in the interdimensional branch I'm now encountering two new issues with building the calculation trees:
In a couple of cases, there appears to be a conflict between weights for a given calculation, where from one source it's 1.0 and in another it's -1.0.
2023-06-29 11:59:55 [ ERROR] catalystcoop.pudl.output.ferc1:1793 Calculation weights do not match for NodeId(source_table='utility_plant_summary_ferc1', xbrl_factoid='accumulated_provision_for_depreciation_amortization_and_depletion_of_plant_utility'):1.0 != -1.0
2023-06-29 11:59:55 [ ERROR] catalystcoop.pudl.output.ferc1:1793 Calculation weights do not match for NodeId(source_table='plant_in_service_ferc1', xbrl_factoid='electric_plant_sold'):1.0 != -1.0
Should electric_plant_sold
have a weight of -1.0 instead? I guess it shows up in one place with +1.0 and another with -1.0. Maybe the expectation I'm asserting is not correct, but it also wasn't failing before.
There seem to be some new inconsistencies between the exploded metadata and the calculations, including a couple of cases where the unfixed rightsutility
string shows up rather than rights_utility
and another one with the CWIP:
KeyError: "[
('utility_plant_summary_ferc1', 'amortization_of_underground_storage_land_and_land_rightsutility_plant_in_service'),
('utility_plant_summary_ferc1', 'amortization_and_depletion_of_producing_natural_gas_land_and_land_rightsutility_plant_in_service'),
('utility_plant_summary_ferc1', 'utility_plant_construction_work_in_progress')
] not in index"
It seems like the first two are probably just the rightsutility
fix somehow not being propagated into either the calculations or the metadata, but I don't understand why the CWIP factoid would have gone missing. I guess this is the disappearing metadata you were talking about.
closing bc we merged #2653 into explode_ferc1
investigate storage/translation of tree-like nature of these nested calculations/relationships. (Simplest current solution is to store the relationships between tables/fields in an arbitrarily nested dictionary. We'd use that dict to take a table or field and convert it into a table with all of its sub-components w/ calculated values identified -> calced using sub-components for validation -> replaced w/ sub-components)
for each xbrl_factoid:
Design Notes
Questions
source_table
is a list of table names, rather than a single table name?Example calculation (JSON):
```json [ { "name": "utility_plant", "weight": 1.0, "source_table": [ "balance_sheet_assets_ferc1" ] }, { "name": "utility_plant_in_service_classified_and_unclassified", "weight": 1.0 }, { "name": "utility_plant_leased_to_others", "weight": 1.0 }, { "name": "utility_plant_held_for_future_use", "weight": 1.0 }, { "name": "construction_work_in_progress", "weight": 1.0 }, { "name": "utility_plant_acquisition_adjustment", "weight": 1.0 }, { "name": "utility_plant_and_construction_work_in_progress_correction", "weight": 1.0 } ] ```Classes / Methods
FercXbrlCalculation
a data class built using Pydantic.Notes from Call w/ CG
name_original
to calculation components inprocess_xbrl_metadata()
CalculationForest
generate an XBRL style calculation for validationspecial_funds_all
&nuclear_fuel
metadata inside ofprocess_xbrl_metadata()
balance_sheet_assets_ferc1
calculation metadata.