Reconcile multiple years of data in XBRL instant tables

zaneselvans commented 1 year ago

In some of the XBRL tables representing instant facts, there's more than one value for date associated with a single report_year. For example:

pis_dbf = pd.read_sql("f1_plant_in_srvce", ferc1_engine)
pis_xbrl_duration = pd.read_sql("electric_plant_in_service_204_duration", ferc1_xbrl_engine)
pis_xbrl_instant = pd.read_sql("electric_plant_in_service_204_instant", ferc1_xbrl_engine)
pis_xbrl_instant.date.value_counts()

2021-12-31    208
2020-12-31    206
Name: date, dtype: int64

How should we deal with this? What does it mean?

In the DBF tables there are sometimes "this year" and "last year" data reported next to each other for comparison. Is that what's going on here? If so, then we only need to keep the current year of data. Need to identify which tables report this way in DBF and compare them to the tables with more than 1 year of data in the XBRL to see if they correspond.

If they do correspond then we should probably explicitly drop the "previous year" data when we are merging the instant and duration data in the XBRL transformers, rather than implicitly dropping those records in the merge.
Also check that this issue isn't coming up in the duration tables with start_date and end_date.
Log when we drop this data in the pre-concatenation XBRL processing.

zaneselvans commented 1 year ago

At least in the plant_in_service table the two years of data that are reported are completely identical in both the instant table. Though strangely there are 206 records in one year and 208 in the other. Only a single year of data appears in the duration table -- but it gets doubled up in the merge as we're performing it now when we bring these two tables together.

ferc1_settings = pudl.settings.Ferc1Settings(
    tables=["plant_in_service_ferc1"],
    years=[2020, 2021],
)

raw_dbf = pudl.extract.ferc1.extract_dbf(
    ferc1_settings,
    pudl_settings,
)

raw_xbrl = pudl.extract.ferc1.extract_xbrl(
    ferc1_settings,
    pudl_settings,
)

# 414 records, 206 from 2021-12-31 & 208 from 2020-12-31
pis_xbrl_instant = raw_xbrl["plant_in_service_ferc1"]["instant"]
assert (pis_xbrl_instant.groupby(["entity_id", "date"]).nunique() > 1).sum().sum() == 0

# Only has 202 records, which aren't actually duplicated before the tables get merged.
pis_xbrl_duration = raw_xbrl["plant_in_service_ferc1"]["duration"]
assert (pis_xbrl_duration.groupby(["entity_id", "start_date"]).nunique() > 1).sum().sum() == 0

Update:

I am a dumbass and was grouping-by unique rows so of course everything was the same. Doing this correctly I find that 394 of the 414 records experienced some year-to-year change in the overall aggrete value electric_plant_in_service so this all seems clear now -- the values in the instant table are end-of-year for the year they pertain to, which lines up with the dates that are actually being used, and avoids the dumb start of year / end of year duplication. So, yay FERC?!

pis_xbrl_instant["changed"] = (pis_xbrl_instant.groupby(["entity_id"])["electric_plant_in_service"].transform("nunique") > 1)
pis_xbrl_instant[pis_xbrl_instant.changed]

zaneselvans commented 1 year ago

If we want to allow aggregation within a given report year, or we want the XBRL & DBF data to be directly comparable, then we need to translate the end-of-last-year and end-of-this-year data which is currently showing up as single columns with different date values in the instant table into separate columns that correspond to beginning of this year, and end of this year.
If we were to keep both of those columns, they would be duplicative, since we should be able to calculate the end-of-year balance based on the starting balance and accumulated adjustments across the year (additions, retirements, etc.).
If we want to keep the database normalized (within a given year) then we should store just the beginning balance and the deltas (there's another kind of aggregation happening elsewhere, within accounting categories, but here we're just concerned with the temporal aggregation across a year).
For development & validation purposes it'll be useful to keep all these start/end of year and categorical aggregated values so we can check that our aggregations reflect the reported numbers. So don't drop them yet.

zaneselvans commented 1 year ago

Not all of the data in the duration and instant tables line up with each other perfectly... There are a few entity_ids that show up ONLY in the instant table -- i.e. they report starting and/or ending balances, but no additions, retirements, etc.
Maybe this is what we would expect for a utility that has plant in service but didn't modify it at all during the year.
There are also a few utilities that don't report a starting balance, but do report changes and an ending balance. In that case it's hypothetically possible to calculate what the starting balance should have been.

zaneselvans commented 1 year ago

Weird missing data issues aside, I think it's clear what the meaning of the multiple years of data is: end of year balances, for this and last year, which get reported in the same year, but have different instantaneous timestamps / dates associated with them.

We're not currently adopting the instant vs. duration model of time. Maybe we should think about that at some point, but in our current data model we want to be able to easily aggregate data within or potentially across report_year values. In the other tables we're already dropping the start/end dates and keeping report_year. To get starting/ending balances into a single report year I think want to:

convert end-of-last-year rows into starting_balance columns
convert end-of-this-year rows into ending_balance columns
concatenate the instant and duration tables along the column axis based on shared (entity_id, report_year) index values.

This arrangement will allow us to look at a given report_year or series of report_years, and apply the many individual additions / retirements etc to the starting balance and calculate an expected ending balance that we can compare to the reported value (hopefully getting the same value), once we've done the reshaping of the concatenated XBRL table into a tidy format.

It seems like working this into the TableTransformer classes will require a pretty different path in the process_xbrl() method, but hopefully this general path can be applied to many of the tables that require reshaping.

This was mostly a research and exploration issue. I'm going to close it and move on to implementation in #2014.

catalyst-cooperative / pudl

Reconcile multiple years of data in XBRL instant tables #2021

Update: