Transform `f1_cash_flow` xbrl + dbf

aesharpe commented 1 year ago

Table Notes

There are 8 XBRL tables (?!) that map to the f1_cash_flow table.
However, 6 (or maybe 7?) of these tables are for "other" cash flows which were previously only reported in footnotes. Those "other" tables are only human readable and poorly structured. They are incorporated into the primary, structured cash flow table as totals.
The primary well-structured cash flow table is statement_of_cash_flows_120 (instant & duration)
This is another "starting balance" + deltas => "ending balance" table, with starting / ending balances reported as "end of last year" and "end of this year". However there are also a bunch of 2019-12-31 dates in the instant table.
Watch out: DBF data is quarterly.
There are a small number of start-of-Q3 dates in the duration table (7, vs 217 or 218 for 2020-01-01 and 2021-01-01). Seems like maybe reporting error?
Looking at the PDF it seems like this is another one-column table, with lots of aggregations.
If we can leave the messy "other" categories aside and integrate them only as totals, then it becomes a simple one-to-one row map.
Confused about what the 7th "other" table is though. The six that seem clear correspond to these cols/rows:
- other_adjustments_by_outside_sources_to_cash_flows_from_financing_activities (row 67)
- other_adjustments_to_cash_flows_from_financing_activities (row 64)
- other_adjustments_to_cash_flows_from_investment_activities (row 53)
- other_adjustments_to_cash_flows_from_operating_activities (row 18)
- other_construction_and_acquisition_of_plant_investment_activities (row 31)
- other_retirements_of_balances_impacting_cash_flows_from_financing_activities (row 76)
The other freeform table appears to be:
- statement_of_cash_flows_sequence_other_operating_cash_flows_120_duration (lines 4-7?)
Lots of different sign conventions in the various rows to be careful with in aggregation.
Columns in the main table don't seem to correspond to FERC accounts. Cashflow has its own organizational scheme.
A cursory exploration of the unstructured tables and how they relate to the main structured table did not make it clear how they are supposed to be incorporated.
- Are their totals reflected in the "other" rows of the same name? The obvious df.groupby().sum() didn't add up as expected.
- Are the unstructured quantities reflected in the overall totals? Is this different for the DBF vs. XBRL tables?

Transform Plan

The main structured XBRL table is a simple 1-column + aggregations, and simple row mappings (though some of the row literals are less descriptive than they could be).
The DBF table doesn't contain the unstructured data for the first 6 unstructured tables listed above.
Lines 4-7 are unstructured and include "non-cash adjustments to cash flows from operating activities" which are pulled out into the unstructured noncash_adjustments_to_cash_flows_from_operating_activities table. Most of these describe amortization of various assets. Some of them are > $1B.

jrea-rmi commented 1 year ago

@aesharpe @cmgosnell I'm reviewing cash_flow_ferc1 from pudl.sqlite. This is a start to similar reviews I'll do of the balance sheet and income statement, working towards "exploded" views of each.

I'm seeing some missing fields and labels in cash_flow_ferc1. From the top few rows of this pdf version of Duke Energy Progress in 2020, I can't find the following fields:

Amortization and Accretion
Net (Increase) Decrease in Mark-to-Market Hedging Transactions
Contributions to Company Sponsored Pension Plans

I also see that the amount_type net_increase_decrease_in_cash_and_cash_equivalents doesn't have a row_type_xbrl, but should be labeled as "calculated".

cmgosnell commented 1 year ago

ah @jrea-rmi ! i believe these lines 5-7 are a part of the unstructured part of the table. the blank form includes no descriptions for these rows and I believe these lines are broken out into this table. Here is the xbrl metadata for the unstructured table that they broke out these rows.

Row 5, is hinting at being the total of the 6-7 rows in the metadata but it does not actually appear to be the total.

I did a little bit of investigating to see if the total line on row # 23 ("Net Cash Provided by (Used in) Operating Activities (Total of Lines 2 thru 21)") sums up to the total value without incorporating these unstructured values annnnnd 🥁 ... it looks like they don't. But also even with these unstructured values it doesn't look like they add up. Here is my pretty rough attempt to look at just DEP:

import pandas as pd
from dagster import AssetKey
import pudl
from pudl.etl import defs

table_name = "cash_flow_ferc1"
xbrl_factoid = "amount_type"
xbrl_meta = defs.load_asset_value(AssetKey("xbrl_metadata_json"))
cf_f1 = pd.read_sql(table_name, pudl_engine)
cash_unstrcut_operating_cash = pd.read_sql("statement_of_cash_flows_sequence_other_operating_cash_flows_120_duration", ferc1_engine_xbrl)
cash_unstruct_other_adjustments = pd.read_sql("statement_of_cash_flows_sequence_other_adjustments_to_operating_cash_flows_120_duration", ferc1_engine_xbrl)

# grab just duke for this test & set the index as the amount  type
# for easier accessing the values with the string bc this is the "xbrl_factoid"
duke_cf = cf_f1[
    (cf_f1.utility_id_ferc1 == 160)
    & (cf_f1.report_year == 2021)
].set_index([xbrl_factoid])

calculated_fact_name ="net_cash_flow_from_operating_activities"
fact_calc = next(
    fact["calculations"] for fact 
    in xbrl_meta[table_name]["duration"] # i just know this is in duration tbl. we could easily generalize this
    if fact["name"] ==calculated_fact_name
)
# get the unique weights to loop over
# could merge the weights in and do columnar weight application
# for a full table calc
weights = {step["weight"] for step in fact_calc}
sum_elements = []
for weight in weights:
    amount_types = [step["name"] for step in fact_calc if step["weight"] == weight]
    sum_elements.append(duke_cf.loc[amount_types, "amount"].sum() * weight)

if not (calced := sum(sum_elements)) == (total := duke_cf.loc[calculated_fact_name, "amount"]):
    logger.warning(
        f"Calced value ({calced}) does not equal total value ({total})\n"
        f"Diff ${(diff:= total - calced)} or {diff/total:.01%}"
    )

cash_unstrcut_operating_cash_total = cash_unstrcut_operating_cash[
    (cash_unstrcut_operating_cash.entity_id == "C000135")
    & (cash_unstrcut_operating_cash.end_date == "2021-12-31")
].noncash_adjustments_to_cash_flows_from_operating_activities.sum()

cash_unstrcut_other_adjust_total = cash_unstruct_other_adjustments[
    (cash_unstruct_other_adjustments.entity_id == "C000135")
    & (cash_unstruct_other_adjustments.end_date == "2021-12-31")
].other_adjustments_to_cash_flows_from_operating_activities.sum()

# try an adjusted total with the unstructured bits
if total != calced + cash_unstrcut_operating_cash_total + cash_unstrcut_other_adjust_total:
    diff = total - (calced + cash_unstrcut_operating_cash_total + cash_unstrcut_other_adjust_total)
    logger.info(f"Caclucated total with adjusted value off by ${diff}/{diff/total:.01%}")

Any thoughts? I'm curious if I'm interpreting these calcs wrong. I've tried this using no weight (the weights are either 1 or -1). and I've tried this applying the weights to the absolute values of all of the amounts. Still no dice.

also: net_increase_decrease_in_cash_and_cash_equivalents is a category that doesn't show up in the structured part of the table and the row_type_xbrl is being pulled directly from the xbrl metadata. But we could certainly just fill this in pretty easily!

zaneselvans commented 1 year ago

Looking at the XBRL calculation that's associated with "Net Cash Provided by (Used in) Operating Activities (Total of Lines 2 thru 21) it seems like there are 2 lines of the calculation that are references to unstructured data:

Noncash Adjustments to Cash Flows from Operating Activities (lines 5-7)
Other Adjustments to Cash Flows from Operating Activities (lines 18-21)

Unfortunately it looks like every subsection of the cashflow table contains some unstructured data so we can't check the calculation on numbers that don't have this complication.

Within the XBRL calculation, only line 5 of 5-7 is identified, so it seems like the calculation is including line 5, and completely ignoring lines 6-7, and there's no mechanism for describing or enforcing a relationship between the contents of line 5 and the contents of lines 6-7. Similarly, out of lines 18-21 only line 18 is listed in the calculation, so lines 19-21 appear to be ignored, with no clear relationship between lines 18 and lines 19-21.

Maybe I'm missing the part where you're doing this selection, but I think there are 2 records that match your entity_id + end_date criteria -- one with start_date=='2021-01-01' (whole year) and one with start_date=='2021-10-01' (4th quarter only).

jrea-rmi commented 1 year ago

The PDF format doesn't indicate that rows 5-7 are related at all, and they have well defined descriptions so I don't see how they're "unstructured". So I'm confused by the XBRL taxonomy referencing rows 5-7.

I would have thought the "unstructured" table would be for row 21 of the statement of cash flows?

In the PDF form, I see row 22 = sum of rows 2-21. Not row 23. So that's confusion about the XBRL taxonomy too.

I'm getting the subtotal in row 22 to match the sum of rows 2-21 for DEP in 2020 based on DBF data, after flipping the sign of the two rows 16 & 17 that have "(Less)" noted in the row description. That value is 1,654,271,678, can you see if you get the same thing for 2020 data?

jrea-rmi commented 1 year ago

It's too bad that row 2, net_income_loss, is labeled as a "calculated value. I was thinking we could only include reported values for table explosion, but in this case cannot.

jrea-rmi commented 1 year ago

also, the credit/debit column doesn't correspond to which values should switch sign (rows 16 & 17)

jrea-rmi commented 1 year ago

In cash_flow_ferc1 for DEP, values from rows 6 and 18-20 are missing but needed to add up to the total in row 22

jrea-rmi commented 1 year ago

Here's what I have for getting numbers to match with both dbf and xbrl formats (xbrl tab requires filling in missing values that presumably come from unstructured tables)

FERC statement of cash flows DEP subtotal check.xlsx

catalyst-cooperative / pudl

Transform `f1_cash_flow` xbrl + dbf #1821

Table Notes

Transform Plan