Integrate re-stated FERC 1 filings appearing in both DBF & XBRL data

zaneselvans commented 1 year ago

Going forward whenever a FERC filing is updated, it seems like the new version will be published in XBRL. This means we may get duplicate data -- the old version that's in DBF, and the updated version in XBRL. The same thing can happen when an originally XBRL filing is updated also in XBRL, but that deduplication is already handled by the ferc_to_sqlite conversion process. Given that we know there will be a unique XBRL filing, and it will be the most recent version, I guess all we need to do is take the XBRL version, when it appears in both XBRL and DBF data. The challenge seems like it may be identifying when there's duplicate filings.

Other changes that these older re-stated filings probably necessitate include:

Updating pudl.settings.Ferc1Settings to allow XBRL data to come from any year.
Updating the ferc1_xbrl_to_sqlite settings to convert data from all historical years, not just the ones where XBRL was the primary reporting format.

cmgosnell commented 1 year ago

i'm going to try to restate the problem in my understanding and offer my own proposal given that understanding (that could be wrong!).

the problem

the last year of dbf data is 2020. xbrl data is supposed to start in 2021. there are xbrl filings that have a report_year of 2020 and correspond to 2020 dbf data. without a screening process, pudl ends up with duplicate 2020 filings. (i'll using 2020 as a stand-in for all 2020 and prior years for simplicity but i assume this can happen with older years)

a maybe wrong but simple solution?

why not just restrict the xbrl data to only 2021 and on? especially if zach already implemented a solution for the duplicate xbrl-only filings. this seems much simpler than reaching back into the old dbf data to find the corresponding record.

solution if data is being updated

is this not the proposal bc these records are actually being updated? as in ferc is re-filing a presumably better version? then we'd need a transform step that checks for duplicate filings and deletes the dbf records.

table level transform

i assume this would need to be post-xbrl/db-concatenation. could we just check for duplicate PKs? I think that would work mostly, but we could otherwise add a data_source_ferc1 column pre-concat and check each table for instances of same ["utility_id_ferc1", "report_year"] and different data_source_ferc1 and drop the data_source_ferc1 == "dbf".

ferc1 transform level transform

With the previous solution, would we miss dropping records from 2020 from specific dbf tables? As in, if an updated filing removed any small plants records in the new xbrl filing, the table-level transform would not drop the dbf records in the small plants table. Since this is a utility-level problem, we could employ a similar solution but post-transformation of all of the ferc1 tables.

zaneselvans commented 1 year ago

My understanding is that these pre-2021 XBRL filings are updates / corrections to data that was previously filed in DBF, and thus the new filings are the most recent / correct versions available, and ideally would be the only version that shows up in the PUDL DB. I think as you suggest, this duplication / update can only be addressed after we've concatenated the XBRL and DBF data (or in the process of concatenating them).

I had been thinking of identifying the duplicate (old DBF + newly filed XBRL) based on PKs, but I think you're right that this would not catch any deletions, so maybe it would be more appropriate to do it at the utility+year level. One way this could go terribly sideways is if the "updated" filings aren't complete, and are instead only providing updated data for a portion of the form. But I hope that's not the case.

zaneselvans commented 1 year ago

Initially, this probably won't be a big deal, since very few updated filings will have been submitted, but over time the number of old DBF filings that should be superseded by newer XBRL filings will grow, and if someone is relying on the data in (say) a regulatory context and they can only see the old data, and not the most recent data, that would be bad.

zaneselvans commented 1 year ago

Thinking about the timing of working on this a little more, I think we should put it off until the end of "phase 2" when we've got all of the existing PUDL tables integrated using the XBRL. I think we'll have more context as to which strategies for resolving the potential duplicates / missing data would work at that point, and for the moment this issue isn't going to affect much data, and it's not being released "in production" until after the PUDL tables are fully integrated anyway.

catalyst-cooperative / pudl