Closed cmgosnell closed 2 years ago
It turns out the migrated XBRL data that FERC released contains respondent_id
's in the file names. This made it easy to create a mapping of respondent_id -> entity_id
(entity_id
is the column name they use for the Company ID) for all of the respondents included in the migrated data. I then used fuzzy string matching to attempt to map any remaining respondents. This is all contained in a notebook that is up for review here.
I have been playing around with different schema options here.
I think there are AT LEAST three options here which varying impacts on schema and table transforms.
The big questions:
utility_id_ferc1
? ( i think yes no matter how we do that so a user can look up all info abt a utility with one id)
entity_id
and the dbf respondent_id
?respondent_id
to entity_id
map, we will probably never build this map during the elt. So we need to make this association and save it once. I'm a lil nervous that there will be weird one-off new entity_id
's that will junk up this process, but adding new ID's to the pudl id mapper is already a manual part of our new data integration process. So we'll need to add some new checks to that step, but all in all this will not add a huge hurdle imo.entity_id
/utility_id_xbrl_ferc1
column into the utilities_output
tab. So that tab would have the pudl, dbf and xbrl ids. We can slurp that up and use that formation in pretty much any of these schema formations.okay @zaneselvans and I chatted about all these questions and.....
entity_id
& reposondent_id
. add in all the non-entity_id
-associated reposondent_id
's. make autoincrement utility_id_ferc1
. for new years, add new entity_id
s and autoincrement utility_id_ferc1
utility_id_ferc1
in them. so will not be stand-alone connected to the respondent_id
or the entity_id
. another good case for db views.utilities_output
tab which has all of xbrl <> dbf <>eia associations, we'll manage these associations seperately: xbrl <> dbf (which makes utility_id_ferc1
) and eia <> ferc (via utility_id_ferc1
) (which makes utility_id_pudl
)Hm I'm run into my first true roadblock for making the utility_id_ferc1
to utility_id_ferc1_xbrl
a one to one mapping.
These two "separate" entities only reported two steam plant records in 2021 each and they are complete duplicates.
Figure out how to link utilities / respondents in the old and new data using some combination of
respondent_id
,utility_id_ferc1
and the new company IDs that appear in the XBRL data.We may be able to use a fuzzy string matcher as a way to make the manual mapping much easier. Here is the lil fuzzy matcher setup I use for RMI - this would 100% need some generalization to work on the IDs.
respondent_id
entity_id
(which may generalize across FERC forms...)utility_id_ferc1
respondent_id
values can be mapped to a newentity_id
because not all old utilities are still reporting.entity_id
values can be mapped to an oldrespondent_id
because some utilities never reported back then.respondent_id
andentity_id
values that do correspond to a utility that exists in both the DBF and XBRL data.An initial mapping between the old DBF
respondent_id
and new XBRLentity_id
has been done by @zschira in https://github.com/catalyst-cooperative/ferc-xbrl-extractor/pull/13However, changes to the database schema and PUDL Utility ID mapping process will be required to manage the existence of both these types of utility ID, and the fact that neither will exist in all years for all utilities. @cmgosnell will tackle that work.