Open zaneselvans opened 1 year ago
i'm going to try to restate the problem in my understanding and offer my own proposal given that understanding (that could be wrong!).
the last year of dbf data is 2020.
xbrl data is supposed to start in 2021.
there are xbrl filings that have a report_year
of 2020 and correspond to 2020 dbf data.
without a screening process, pudl ends up with duplicate 2020 filings.
(i'll using 2020 as a stand-in for all 2020 and prior years for simplicity but i assume this can happen with older years)
why not just restrict the xbrl data to only 2021 and on? especially if zach already implemented a solution for the duplicate xbrl-only filings. this seems much simpler than reaching back into the old dbf data to find the corresponding record.
is this not the proposal bc these records are actually being updated? as in ferc is re-filing a presumably better version? then we'd need a transform step that checks for duplicate filings and deletes the dbf records.
i assume this would need to be post-xbrl/db-concatenation. could we just check for duplicate PKs? I think that would work mostly, but we could otherwise add a data_source_ferc1
column pre-concat and check each table for instances of same ["utility_id_ferc1", "report_year"]
and different data_source_ferc1
and drop the data_source_ferc1 == "dbf"
.
With the previous solution, would we miss dropping records from 2020 from specific dbf tables? As in, if an updated filing removed any small plants records in the new xbrl filing, the table-level transform would not drop the dbf records in the small plants table. Since this is a utility-level problem, we could employ a similar solution but post-transformation of all of the ferc1 tables.
My understanding is that these pre-2021 XBRL filings are updates / corrections to data that was previously filed in DBF, and thus the new filings are the most recent / correct versions available, and ideally would be the only version that shows up in the PUDL DB. I think as you suggest, this duplication / update can only be addressed after we've concatenated the XBRL and DBF data (or in the process of concatenating them).
I had been thinking of identifying the duplicate (old DBF + newly filed XBRL) based on PKs, but I think you're right that this would not catch any deletions, so maybe it would be more appropriate to do it at the utility+year level. One way this could go terribly sideways is if the "updated" filings aren't complete, and are instead only providing updated data for a portion of the form. But I hope that's not the case.
Initially, this probably won't be a big deal, since very few updated filings will have been submitted, but over time the number of old DBF filings that should be superseded by newer XBRL filings will grow, and if someone is relying on the data in (say) a regulatory context and they can only see the old data, and not the most recent data, that would be bad.
Thinking about the timing of working on this a little more, I think we should put it off until the end of "phase 2" when we've got all of the existing PUDL tables integrated using the XBRL. I think we'll have more context as to which strategies for resolving the potential duplicates / missing data would work at that point, and for the moment this issue isn't going to affect much data, and it's not being released "in production" until after the PUDL tables are fully integrated anyway.
Going forward whenever a FERC filing is updated, it seems like the new version will be published in XBRL. This means we may get duplicate data -- the old version that's in DBF, and the updated version in XBRL. The same thing can happen when an originally XBRL filing is updated also in XBRL, but that deduplication is already handled by the
ferc_to_sqlite
conversion process. Given that we know there will be a unique XBRL filing, and it will be the most recent version, I guess all we need to do is take the XBRL version, when it appears in both XBRL and DBF data. The challenge seems like it may be identifying when there's duplicate filings.Other changes that these older re-stated filings probably necessitate include:
pudl.settings.Ferc1Settings
to allow XBRL data to come from any year.ferc1_xbrl_to_sqlite
settings to convert data from all historical years, not just the ones where XBRL was the primary reporting format.