Link utility IDs from FERC 1 DBF and XBRL data

cmgosnell commented 2 years ago

Figure out how to link utilities / respondents in the old and new data using some combination of respondent_id, utility_id_ferc1 and the new company IDs that appear in the XBRL data.

We may be able to use a fuzzy string matcher as a way to make the manual mapping much easier. Here is the lil fuzzy matcher setup I use for RMI - this would 100% need some generalization to work on the IDs.

In the FERC 1 DBF data, we have respondent_id
In the FERC 1 XBRL data, we have entity_id (which may generalize across FERC forms...)
In PUDL we have utility_id_ferc1
Not all old respondent_id values can be mapped to a new entity_id because not all old utilities are still reporting.
Not all new entity_id values can be mapped to an old respondent_id because some utilities never reported back then.
We will need some way to link the respondent_id and entity_id values that do correspond to a utility that exists in both the DBF and XBRL data.
We will also need a single ID that can be used to refer to any of the FERC 1 Utilities, whether they show up in the DBF, the XBRL, or both datasets.

An initial mapping between the old DBF respondent_id and new XBRL entity_id has been done by @zschira in https://github.com/catalyst-cooperative/ferc-xbrl-extractor/pull/13

However, changes to the database schema and PUDL Utility ID mapping process will be required to manage the existence of both these types of utility ID, and the fact that neither will exist in all years for all utilities. @cmgosnell will tackle that work.

zschira commented 2 years ago

It turns out the migrated XBRL data that FERC released contains respondent_id's in the file names. This made it easy to create a mapping of respondent_id -> entity_id (entity_id is the column name they use for the Company ID) for all of the respondents included in the migrated data. I then used fuzzy string matching to attempt to map any remaining respondents. This is all contained in a notebook that is up for review here.

cmgosnell commented 2 years ago

I have been playing around with different schema options here.

I think there are AT LEAST three options here which varying impacts on schema and table transforms.

The big questions:

What schema option to choose (or adaptations to these schema options)?
- I think I like option 1 bc of its simplicity, even though option 3 is slightly more normalized.
- I included option 2 even though I think it is a little horrid mostly because it would enable NOT ripping the native IDs out of the concatenated data tables.
should we make our own new utility_id_ferc1? ( i think yes no matter how we do that so a user can look up all info abt a utility with one id)
- should it be an autoincremted ID OR a composite of the xbrl entity_id and the dbf respondent_id?
How to manage building the associations?
- I hope we can just use the pudl id mapping spreadsheet to store the association. Bc we never plan on integrating the old migrated dbf -> xbrl files that zach built the initial version of the respondent_id to entity_id map, we will probably never build this map during the elt. So we need to make this association and save it once. I'm a lil nervous that there will be weird one-off new entity_id's that will junk up this process, but adding new ID's to the pudl id mapper is already a manual part of our new data integration process. So we'll need to add some new checks to that step, but all in all this will not add a huge hurdle imo.
How to manage these IDs in "data" tables (i.e. the non-entity/ID tables)?
- I think we are wading into an entanglement of a glue table and the transform process. almost no matter how i can think about this
How to integrate this all with the utility_id_pudl mapper?
- I actually think this is relatively straightforward (i know weird!). I think we add a new entity_id/utility_id_xbrl_ferc1 column into the utilities_output tab. So that tab would have the pudl, dbf and xbrl ids. We can slurp that up and use that formation in pretty much any of these schema formations.

cmgosnell commented 2 years ago

okay @zaneselvans and I chatted about all these questions and.....

Option 3! more normalized
autoincremter! easy to maintain despite creating shifting values over time
use zach's draft association table w/ entity_id & reposondent_id. add in all the non-entity_id-associated reposondent_id's. make autoincrement utility_id_ferc1. for new years, add new entity_ids and autoincrement utility_id_ferc1
"Data" tables will have the new utility_id_ferc1 in them. so will not be stand-alone connected to the respondent_id or the entity_id. another good case for db views.
Instead of having on utilities_output tab which has all of xbrl <> dbf <>eia associations, we'll manage these associations seperately: xbrl <> dbf (which makes utility_id_ferc1) and eia <> ferc (via utility_id_ferc1) (which makes utility_id_pudl)

cmgosnell commented 2 years ago

Hm I'm run into my first true roadblock for making the utility_id_ferc1 to utility_id_ferc1_xbrl a one to one mapping.

These two "separate" entities only reported two steam plant records in 2021 each and they are complete duplicates.

catalyst-cooperative / pudl

Link utility IDs from FERC 1 DBF and XBRL data #1705