catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Link coalmine_info_eia923 table with fuel_receipts_costs_eia923 table #89

Closed alanawlsn closed 7 years ago

alanawlsn commented 7 years ago

The surrogate key (coalmine_id) in coalmine_info_eia923 needs to be populated in fuel_receipts_costs_eia923 table so the two tables can be joined.

zaneselvans commented 7 years ago

The problem here is that we have one big table which we are splitting into two tables in order to reduce data duplication -- the same mines and the same information about the mines is listed many times in fuel_receipts_costs_eia923. However, there is no natural key for the implied coalmine_info_eia923 records -- we're generating a surrogate key. But because the two tables are being populated independently, we don't know what coalmine_info_eia923 surrogate key to use when we populate the fuel_receipts_costs_eia923 table, to refer to the appropriate coalmine_info_eia923 record.

So, do we need to populate them in tandem and somehow create a record in each at the same time, with the same coalmine_id value (which we get handed back to us when the coalmine_info_eia923 record is created) or do we need to populate the coalmine_info_eia923 table first, and then do a lookup with the relevant coal mine information in each fuel_receipts_costs_eia923 record, to determine what value to use?

zaneselvans commented 7 years ago

One way to deal with this might be just to read a dataframe from the coalmine_info table after it's been populated, and merge it with the fuel_receipts_costs table based on the fields which we are requiring to be unique in coalmine_info (currently coalmine_name and coalmine_msha_id, though maybe coalmine_state would be a good one to add, since a fair number of the mines do not have an MSHA id associated with them in the FRC table). This feels like a hack, but it would be much better than what we have now.