Review resources with no primary key

A review of the resource metadata in src/pudl/package_data/meta/datapkg/datapackage.json reveals the following resources are missing primary keys, an attribute required by the new harvest process (#806).

Do additional resource primary keys exist elsewhere than src/pudl/package_data/meta/datapkg/datapackage.json?

Our task is to assign a primary key for each of these resources, and if that isn't possible, determine if and how the harvest process needs to be amended.

resource	fields
fuel_ferc1	record_id, utility_id_ferc1, report_year, plant_name_ferc1, fuel_type_code_pudl, fuel_unit, fuel_qty_burned, fuel_mmbtu_per_unit, fuel_cost_per_unit_burned, fuel_cost_per_unit_delivered, fuel_cost_per_mmbtu
load_curves_epaipm	region_id_epaipm, month, day_of_year, hour, time_index, load_mw
plant_region_map_epaipm	plant_id_eia, region
generation_eia923	plant_id_eia, generator_id, report_date, net_generation_mwh
plants_hydro_ferc1	record_id, utility_id_ferc1, report_year, plant_name_ferc1, project_num, plant_type, construction_type, construction_year, installation_year, capacity_mw, peak_demand_mw, plant_hours_connected_while_generating, net_capacity_favorable_conditions_mw, net_capacity_adverse_conditions_mw, avg_num_employees, net_generation_mwh, capex_land, capex_structures, capex_facilities, capex_equipment, capex_roads, asset_retirement_cost, capex_total, capex_per_mw, opex_operations, opex_water_for_power, opex_hydraulic, opex_electric, opex_generation_misc, opex_rents, opex_engineering, opex_structures, opex_dams, opex_plant, opex_misc_plant, opex_total, opex_per_mwh
purchased_power_ferc1	record_id, utility_id_ferc1, report_year, seller_name, purchase_type, tariff, billing_demand_mw, non_coincident_peak_demand_mw, coincident_peak_demand_mw, purchased_mwh, received_mwh, delivered_mwh, demand_charges, energy_charges, other_charges, total_settlement
generators_eia860	plant_id_eia, generator_id, report_date, operational_status_code, operational_status, ownership_code, utility_id_eia, capacity_mw, summer_capacity_mw, winter_capacity_mw, energy_source_code_1, energy_source_code_2, energy_source_code_3, energy_source_code_4, energy_source_code_5, energy_source_code_6, fuel_type_code_pudl, multiple_fuels, deliver_power_transgrid, syncronized_transmission_grid, turbines_num, planned_modifications, planned_net_summer_capacity_uprate_mw, planned_net_winter_capacity_uprate_mw, planned_uprate_date, planned_net_summer_capacity_derate_mw, planned_net_winter_capacity_derate_mw, planned_derate_date, planned_new_prime_mover_code, planned_energy_source_code_1, planned_repower_date, other_planned_modifications, other_modifications_date, planned_retirement_date, carbon_capture, startup_source_code_1, startup_source_code_2, startup_source_code_3, startup_source_code_4, technology_description, turbines_inverters_hydrokinetics, time_cold_shutdown_full_load_code, planned_new_capacity_mw, cofire_fuels, switch_oil_gas, nameplate_power_factor, minimum_load_mw, uprate_derate_during_year, uprate_derate_completed_date, current_planned_operating_date, summer_estimated_capability_mw, winter_estimated_capability_mw, retirement_date
ownership_eia860	report_date, utility_id_eia, plant_id_eia, generator_id, owner_utility_id_eia, owner_name, owner_state, owner_city, owner_street_address, owner_zip_code, fraction_owned
boiler_generator_assn_eia860	plant_id_eia, report_date, generator_id, boiler_id, unit_id_eia, unit_id_pudl, bga_source
plants_eia860	plant_id_eia, report_date, ash_impoundment, ash_impoundment_lined, ash_impoundment_status, energy_storage, ferc_cogen_docket_no, ferc_exempt_wholesale_generator_docket_no, ferc_small_power_producer_docket_no, liquefied_natural_gas_storage, natural_gas_local_distribution_company, natural_gas_storage, natural_gas_pipeline_name_1, natural_gas_pipeline_name_2, natural_gas_pipeline_name_3, net_metering, pipeline_notes, regulatory_status_code, transmission_distribution_owner_id, transmission_distribution_owner_name, transmission_distribution_owner_state, utility_id_eia, water_source
generation_fuel_eia923	plant_id_eia, report_date, nuclear_unit_id, fuel_type, fuel_type_code_pudl, fuel_type_code_aer, prime_mover_code, fuel_consumed_units, fuel_consumed_for_electricity_units, fuel_mmbtu_per_unit, fuel_consumed_mmbtu, fuel_consumed_for_electricity_mmbtu, net_generation_mwh
plants_small_ferc1	record_id, utility_id_ferc1, report_year, plant_name_original, plant_name_ferc1, plant_type, ferc_license_id, construction_year, capacity_mw, peak_demand_mw, net_generation_mwh, total_cost_of_plant, capex_per_mw, opex_total, opex_fuel, opex_maintenance, fuel_type, fuel_cost_per_mmbtu
plants_pumped_storage_ferc1	record_id, utility_id_ferc1, report_year, plant_name_ferc1, project_num, construction_type, construction_year, installation_year, capacity_mw, peak_demand_mw, plant_hours_connected_while_generating, plant_capability_mw, avg_num_employees, net_generation_mwh, energy_used_for_pumping_mwh, net_load_mwh, capex_land, capex_structures, capex_facilities, capex_wheels_turbines_generators, capex_equipment_electric, capex_equipment_misc, capex_roads, asset_retirement_cost, capex_total, capex_per_mw, opex_operations, opex_water_for_power, opex_pumped_storage, opex_electric, opex_generation_misc, opex_rents, opex_engineering, opex_structures, opex_dams, opex_plant, opex_misc_plant, opex_production_before_pumping, opex_pumping, opex_total, opex_per_mwh
boiler_fuel_eia923	plant_id_eia, boiler_id, fuel_type_code, fuel_type_code_pudl, report_date, fuel_consumed_units, fuel_mmbtu_per_unit, sulfur_content_pct, ash_content_pct
hourly_emissions_epacems	state, plant_id_eia, unitid, operating_datetime_utc, operating_time_hours, gross_load_mw, steam_load_1000_lbs, so2_mass_lbs, so2_mass_measurement_code, nox_rate_lbs_mmbtu, nox_rate_measurement_code, nox_mass_lbs, nox_mass_measurement_code, co2_mass_tons, co2_mass_measurement_code, heat_content_mmbtu, facility_id, unit_id_epa
transmission_single_epaipm	region_from, region_to, firm_ttc_mw, nonfirm_ttc_mw, tariff_mills_kwh
plants_steam_ferc1	record_id, utility_id_ferc1, report_year, plant_id_ferc1, plant_name_ferc1, plant_type, construction_type, construction_year, installation_year, capacity_mw, peak_demand_mw, plant_hours_connected_while_generating, plant_capability_mw, water_limited_capacity_mw, not_water_limited_capacity_mw, avg_num_employees, net_generation_mwh, capex_land, capex_structures, capex_equipment, capex_total, capex_per_mw, opex_operations, opex_fuel, opex_coolants, opex_steam, opex_steam_other, opex_transfer, opex_electric, opex_misc_power, opex_rents, opex_allowances, opex_engineering, opex_structures, opex_boiler, opex_plants, opex_misc_steam, opex_production_total, opex_per_mwh, asset_retirement_cost
utilities_eia860	utility_id_eia, report_date, street_address, city, state, zip_code, plants_reported_owner, plants_reported_operator, plants_reported_asset_manager, plants_reported_other_relationship
transmission_joint_epaipm	joint_constraint_id, region_from, region_to, firm_ttc_mw, nonfirm_ttc_mw

There's a subset of the tables which end up with an auto-incrementing surrogate key because they don't have natural primary keys. The one that jumps to mind right now is fuel_receipts_costs_eia923 which I notice isn't on this list, so that's interesting.

The autoincrementing surrogate keys are currently implemented inelegantly, in the process of loading the data packages into SQLite. Previously it was handled automatically by PostgreSQL. We briefly considered adding the autoincrementing surrogate keys inside of the ETL process, but didn't go down that road. Maybe it makes sense to revisit that now?

We need to decide what we're doing with all the IPM tables going forward. We've been talking about deprecating, and this would probably be the time to commit to that.

No natural primary key and we can fix them:

generation_fuel_eia923 (should be: report_date, plant_id_eia, fuel_type, prime_mover_code, nuclear_unit_id, but there are a bunch of duplicate rows in which one of the duplicates always contains all NaN values and they should probably be dropped in the transform step, )
boiler_fuel_eia923: (should be: report_date, plant_id_eia, boiler_id, fuel_type_code, but there are 4000 rows out of 1.14M that actually contain data and are duplicates on that basis, and they are all natural gas, and we could definitely aggregate them... maybe we should update the transform step)

No natural primary key and we need a surrogate key

fuel_ferc1 (should be report_year, respondent_id, plant_name, fuel_type... but ferc1 data is bad and different units within plants are getting reported with indistinguishable key fields)
plants_hydro_ferc1: should be report_year, utility_id_ferc1, plant_name_ferc1, but data is bad.
purchased_power_ferc1: There's truly no natural primary key because it represents individual purchases.
plants_small_ferc1: Should be report_year, utility_id_ferc1, plant_name_ferc1, but multiple units at the same site are being reported with the same keys, and cannot be distinguished in about 1% of records.
plants_pumped_storage_ferc1: Should be report_year, utility_id_ferc1, plant_name_ferc1 but a single plant named ludington or sometimes lundington is fucking it up for everybody.
plants_steam_ferc1: In a sane world it would be report_year, utility_id_ferc1, plant_name_ferc1 (OR PLANT_ID_FERC1 JFC) but we do not live in that world so there is no natural primary key because separate units at many plants are reported with identical information.

EIA/EPA Tables that actually do have natural primary keys

generation_eia923: report_date, plant_id_eia, generator_id
generators_eia860: report_date, plant_id_eia, generator_id
ownership_eia860: report_date, owner_utility_id_eia, plant_id_eia, generator_id
boiler_generator_assn_eia860: report_date, plant_id_eia, generator_id, boiler_id
plants_eia860: report_date, plant_id_eia
utilities_eia860: report_date, utility_id_eia
hourly_emissions_epacems: plant_id_eia, unitid, operating_datetime_utc (where unitid is the "emissions" unit aka smokestack I think)

FERC 1 Tables that actually do have natural primary keys

j/k FERC doesn't know what a database is.

Side note: there are many foreign key relationships which are not enumerated in the current version of the metadata, which we should absolutely add. They include but probably aren't limited to the following:

utility_id_eia or owner_utility_id_eia => utilities_entity_eia.utility_id_eia
[report_date, utility_id_eia or owner_utility_id_eia] => utilities_eia860.(report_date, utility_id_eia)
plant_id_eia => plants_entity_eia.plant_id_eia
[report_date, plant_id_eia] => plants_eia860.(report_date, plant_id_eia)
[plant_id_eia, generator_id] => generators_entity_eia.(plant_id_eia, generator_id)
[report_date, plant_id_eia, generator_id] => generators_eia860.(report_date, plant_id_eia, generator_id)
[plant_id_eia, boiler_id] => boilers_entity_eia.(plant_id_eia, boiler_id)

the tables which we currently use an auto-incremented id column for should are the ones in the autoincrement dictionary in the meta/datapkg/datapackage.json. I agree w/ zane that a few of these tables should be converted to have true PKs, but for the rest of them their PK should just be the id column. And truly if you needed a quick implementation for these tables that need some munging to use natural PKs, the PK right now is effectively the id column

There's a subset of the tables which end up with an auto-incrementing surrogate key because they don't have natural primary keys. The one that jumps to mind right now is fuel_receipts_costs_eia923 which I notice isn't on this list, so that's interesting.

fuel_receipts_costs_eia923 is unique in that it shows up with "fields": [ "id", ...] and "primaryKey": ["id"] in datapackage.json. I presume this was a mistake?

@cmgosnell My understanding is that the id surrogate column only comes into existence upon database import, and thus after Python ETL and datapackage creation?

No natural primary key and we need a surrogate key

Primary keys are not required. What is the purpose, for PUDL, of using auto-incrementing surrogates as primary keys? They neither ensure row uniqueness nor can they be used for table joins. I feel like the default should either be no primary key or a primary key composed of every field in the table. SQLITE uses ROWID in the absence of an explicit primary key (https://www.sqlite.org/lang_createtable.html#rowid).

If auto-incrementing keys are needed, it seems worth populating these in the ETL so that the Python, data package, and database outputs all match in form and content (at the expense of a larger dataset). This also gives you the option of using the surrogates in foreign keys if you ever dynamically deduce relationships involving these tables.

@zaneselvans I've added the keys you listed in "EIA/EPA Tables that actually do have natural primary keys". I'll address foreign keys in a separate issue.

right now the id cols get added in the load step. But we could certainly move that to happen in the transform step instead.

It does this by looking at the metadata, which really doesn't happen in the E/T steps right now so the load step felt like a logical place to do it... but that is going to change with this new harvesting process.

@cmgosnell It seems cleaner to export data packages with id as column and primary key (and dropping the autoincrement attribute from the package's datapackage.json), if that is what is also being used in database form. I can amend the harvest process to set the primary key fields as index or, if not provided, rename the default pandas integer index to id. Then, either way, you just df.to_csv().

@zaneselvans All the primary keys in your "EIA/EPA Tables that actually do have natural primary keys" are listed in autoincrement. @cmgosnell Can you confirm that these tables are in fact ready for their natural primary key?

IIRC @ezwelty I think the fuel_receipts_costs_eia923 table was a special case, where we had to add the surrogate key during the transform step because of the way that we were ripping the coalmines_eia923 table out of it. I could be mistaken on that though.

When I made that list of tables with (apparent) natural keys I was pulling the tables from the DB directly and verifying their uniqueness so... hopefully they really do work.

Again IIRC, a lot of these shenanigans came up because PostgreSQL and SQLite deal with tables lacking explicit primary keys differently, and we were trying to preserve the ability for the datapackages to get loaded into either DB, when we were transitioning away from loading into PostgreSQL. Postgres automatically added the surrogate keys, but SQLite did something different, and it broke stuff. But I don't recall what stuff.

@zaneselvans What about transmission_single_epaipm and transmission_joint_epaipm? Do they have a natural primary key?

catalyst-cooperative / pudl