Closed alanawlsn closed 7 years ago
The problem here is that we have one big table which we are splitting into two tables in order to reduce data duplication -- the same mines and the same information about the mines is listed many times in fuel_receipts_costs_eia923. However, there is no natural key for the implied coalmine_info_eia923 records -- we're generating a surrogate key. But because the two tables are being populated independently, we don't know what coalmine_info_eia923 surrogate key to use when we populate the fuel_receipts_costs_eia923 table, to refer to the appropriate coalmine_info_eia923 record.
So, do we need to populate them in tandem and somehow create a record in each at the same time, with the same coalmine_id value (which we get handed back to us when the coalmine_info_eia923 record is created) or do we need to populate the coalmine_info_eia923 table first, and then do a lookup with the relevant coal mine information in each fuel_receipts_costs_eia923 record, to determine what value to use?
One way to deal with this might be just to read a dataframe from the coalmine_info table after it's been populated, and merge it with the fuel_receipts_costs table based on the fields which we are requiring to be unique in coalmine_info (currently coalmine_name and coalmine_msha_id, though maybe coalmine_state would be a good one to add, since a fair number of the mines do not have an MSHA id associated with them in the FRC table). This feels like a hack, but it would be much better than what we have now.
The surrogate key (coalmine_id) in coalmine_info_eia923 needs to be populated in fuel_receipts_costs_eia923 table so the two tables can be joined.