catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Transform `f1_cash_flow` xbrl + dbf #1821

Open aesharpe opened 1 year ago

aesharpe commented 1 year ago

Table Notes

Transform Plan

jrea-rmi commented 1 year ago

@aesharpe @cmgosnell I'm reviewing cash_flow_ferc1 from pudl.sqlite. This is a start to similar reviews I'll do of the balance sheet and income statement, working towards "exploded" views of each.

I'm seeing some missing fields and labels in cash_flow_ferc1. From the top few rows of this pdf version of Duke Energy Progress in 2020, I can't find the following fields:

  1. Amortization and Accretion
  2. Net (Increase) Decrease in Mark-to-Market Hedging Transactions
  3. Contributions to Company Sponsored Pension Plans

Image

I also see that the amount_type net_increase_decrease_in_cash_and_cash_equivalents doesn't have a row_type_xbrl, but should be labeled as "calculated".

cmgosnell commented 1 year ago

ah @jrea-rmi ! i believe these lines 5-7 are a part of the unstructured part of the table. the blank form includes no descriptions for these rows and I believe these lines are broken out into this table. Here is the xbrl metadata for the unstructured table that they broke out these rows.

Row 5, is hinting at being the total of the 6-7 rows in the metadata but it does not actually appear to be the total.

I did a little bit of investigating to see if the total line on row # 23 ("Net Cash Provided by (Used in) Operating Activities (Total of Lines 2 thru 21)") sums up to the total value without incorporating these unstructured values annnnnd 🥁 ... it looks like they don't. But also even with these unstructured values it doesn't look like they add up. Here is my pretty rough attempt to look at just DEP:

import pandas as pd
from dagster import AssetKey
import pudl
from pudl.etl import defs

table_name = "cash_flow_ferc1"
xbrl_factoid = "amount_type"
xbrl_meta = defs.load_asset_value(AssetKey("xbrl_metadata_json"))
cf_f1 = pd.read_sql(table_name, pudl_engine)
cash_unstrcut_operating_cash = pd.read_sql("statement_of_cash_flows_sequence_other_operating_cash_flows_120_duration", ferc1_engine_xbrl)
cash_unstruct_other_adjustments = pd.read_sql("statement_of_cash_flows_sequence_other_adjustments_to_operating_cash_flows_120_duration", ferc1_engine_xbrl)

# grab just duke for this test & set the index as the amount  type
# for easier accessing the values with the string bc this is the "xbrl_factoid"
duke_cf = cf_f1[
    (cf_f1.utility_id_ferc1 == 160)
    & (cf_f1.report_year == 2021)
].set_index([xbrl_factoid])

calculated_fact_name ="net_cash_flow_from_operating_activities"
fact_calc = next(
    fact["calculations"] for fact 
    in xbrl_meta[table_name]["duration"] # i just know this is in duration tbl. we could easily generalize this
    if fact["name"] ==calculated_fact_name
)
# get the unique weights to loop over
# could merge the weights in and do columnar weight application
# for a full table calc
weights = {step["weight"] for step in fact_calc}
sum_elements = []
for weight in weights:
    amount_types = [step["name"] for step in fact_calc if step["weight"] == weight]
    sum_elements.append(duke_cf.loc[amount_types, "amount"].sum() * weight)

if not (calced := sum(sum_elements)) == (total := duke_cf.loc[calculated_fact_name, "amount"]):
    logger.warning(
        f"Calced value ({calced}) does not equal total value ({total})\n"
        f"Diff ${(diff:= total - calced)} or {diff/total:.01%}"
    )

cash_unstrcut_operating_cash_total = cash_unstrcut_operating_cash[
    (cash_unstrcut_operating_cash.entity_id == "C000135")
    & (cash_unstrcut_operating_cash.end_date == "2021-12-31")
].noncash_adjustments_to_cash_flows_from_operating_activities.sum()

cash_unstrcut_other_adjust_total = cash_unstruct_other_adjustments[
    (cash_unstruct_other_adjustments.entity_id == "C000135")
    & (cash_unstruct_other_adjustments.end_date == "2021-12-31")
].other_adjustments_to_cash_flows_from_operating_activities.sum()

# try an adjusted total with the unstructured bits
if total != calced + cash_unstrcut_operating_cash_total + cash_unstrcut_other_adjust_total:
    diff = total - (calced + cash_unstrcut_operating_cash_total + cash_unstrcut_other_adjust_total)
    logger.info(f"Caclucated total with adjusted value off by ${diff}/{diff/total:.01%}")
image

Any thoughts? I'm curious if I'm interpreting these calcs wrong. I've tried this using no weight (the weights are either 1 or -1). and I've tried this applying the weights to the absolute values of all of the amounts. Still no dice.

also: net_increase_decrease_in_cash_and_cash_equivalents is a category that doesn't show up in the structured part of the table and the row_type_xbrl is being pulled directly from the xbrl metadata. But we could certainly just fill this in pretty easily!

zaneselvans commented 1 year ago

Looking at the XBRL calculation that's associated with "Net Cash Provided by (Used in) Operating Activities (Total of Lines 2 thru 21) it seems like there are 2 lines of the calculation that are references to unstructured data:

Unfortunately it looks like every subsection of the cashflow table contains some unstructured data so we can't check the calculation on numbers that don't have this complication.

Within the XBRL calculation, only line 5 of 5-7 is identified, so it seems like the calculation is including line 5, and completely ignoring lines 6-7, and there's no mechanism for describing or enforcing a relationship between the contents of line 5 and the contents of lines 6-7. Similarly, out of lines 18-21 only line 18 is listed in the calculation, so lines 19-21 appear to be ignored, with no clear relationship between lines 18 and lines 19-21.

Maybe I'm missing the part where you're doing this selection, but I think there are 2 records that match your entity_id + end_date criteria -- one with start_date=='2021-01-01' (whole year) and one with start_date=='2021-10-01' (4th quarter only).

jrea-rmi commented 1 year ago

The PDF format doesn't indicate that rows 5-7 are related at all, and they have well defined descriptions so I don't see how they're "unstructured". So I'm confused by the XBRL taxonomy referencing rows 5-7.

I would have thought the "unstructured" table would be for row 21 of the statement of cash flows?

In the PDF form, I see row 22 = sum of rows 2-21. Not row 23. So that's confusion about the XBRL taxonomy too.

I'm getting the subtotal in row 22 to match the sum of rows 2-21 for DEP in 2020 based on DBF data, after flipping the sign of the two rows 16 & 17 that have "(Less)" noted in the row description. That value is 1,654,271,678, can you see if you get the same thing for 2020 data?

jrea-rmi commented 1 year ago

It's too bad that row 2, net_income_loss, is labeled as a "calculated value. I was thinking we could only include reported values for table explosion, but in this case cannot.

jrea-rmi commented 1 year ago

also, the credit/debit column doesn't correspond to which values should switch sign (rows 16 & 17)

jrea-rmi commented 1 year ago

In cash_flow_ferc1 for DEP, values from rows 6 and 18-20 are missing but needed to add up to the total in row 22

jrea-rmi commented 1 year ago

Here's what I have for getting numbers to match with both dbf and xbrl formats (xbrl tab requires filling in missing values that presumably come from unstructured tables)

FERC statement of cash flows DEP subtotal check.xlsx