catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Reconcile multiple years of data in XBRL instant tables #2021

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

In some of the XBRL tables representing instant facts, there's more than one value for date associated with a single report_year. For example:

pis_dbf = pd.read_sql("f1_plant_in_srvce", ferc1_engine)
pis_xbrl_duration = pd.read_sql("electric_plant_in_service_204_duration", ferc1_xbrl_engine)
pis_xbrl_instant = pd.read_sql("electric_plant_in_service_204_instant", ferc1_xbrl_engine)
pis_xbrl_instant.date.value_counts()
2021-12-31    208
2020-12-31    206
Name: date, dtype: int64

How should we deal with this? What does it mean?

In the DBF tables there are sometimes "this year" and "last year" data reported next to each other for comparison. Is that what's going on here? If so, then we only need to keep the current year of data. Need to identify which tables report this way in DBF and compare them to the tables with more than 1 year of data in the XBRL to see if they correspond.

zaneselvans commented 1 year ago

At least in the plant_in_service table the two years of data that are reported are completely identical in both the instant table. Though strangely there are 206 records in one year and 208 in the other. Only a single year of data appears in the duration table -- but it gets doubled up in the merge as we're performing it now when we bring these two tables together.

ferc1_settings = pudl.settings.Ferc1Settings(
    tables=["plant_in_service_ferc1"],
    years=[2020, 2021],
)

raw_dbf = pudl.extract.ferc1.extract_dbf(
    ferc1_settings,
    pudl_settings,
)

raw_xbrl = pudl.extract.ferc1.extract_xbrl(
    ferc1_settings,
    pudl_settings,
)

# 414 records, 206 from 2021-12-31 & 208 from 2020-12-31
pis_xbrl_instant = raw_xbrl["plant_in_service_ferc1"]["instant"]
assert (pis_xbrl_instant.groupby(["entity_id", "date"]).nunique() > 1).sum().sum() == 0

# Only has 202 records, which aren't actually duplicated before the tables get merged.
pis_xbrl_duration = raw_xbrl["plant_in_service_ferc1"]["duration"]
assert (pis_xbrl_duration.groupby(["entity_id", "start_date"]).nunique() > 1).sum().sum() == 0

Update:

I am a dumbass and was grouping-by unique rows so of course everything was the same. Doing this correctly I find that 394 of the 414 records experienced some year-to-year change in the overall aggrete value electric_plant_in_service so this all seems clear now -- the values in the instant table are end-of-year for the year they pertain to, which lines up with the dates that are actually being used, and avoids the dumb start of year / end of year duplication. So, yay FERC?!

pis_xbrl_instant["changed"] = (pis_xbrl_instant.groupby(["entity_id"])["electric_plant_in_service"].transform("nunique") > 1)
pis_xbrl_instant[pis_xbrl_instant.changed]
zaneselvans commented 1 year ago
zaneselvans commented 1 year ago
zaneselvans commented 1 year ago

Weird missing data issues aside, I think it's clear what the meaning of the multiple years of data is: end of year balances, for this and last year, which get reported in the same year, but have different instantaneous timestamps / dates associated with them.

We're not currently adopting the instant vs. duration model of time. Maybe we should think about that at some point, but in our current data model we want to be able to easily aggregate data within or potentially across report_year values. In the other tables we're already dropping the start/end dates and keeping report_year. To get starting/ending balances into a single report year I think want to:

This arrangement will allow us to look at a given report_year or series of report_years, and apply the many individual additions / retirements etc to the starting balance and calculate an expected ending balance that we can compare to the reported value (hopefully getting the same value), once we've done the reshaping of the concatenated XBRL table into a tidy format.

It seems like working this into the TableTransformer classes will require a pretty different path in the process_xbrl() method, but hopefully this general path can be applied to many of the tables that require reshaping.

This was mostly a research and exploration issue. I'm going to close it and move on to implementation in #2014.