catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Review resources with no primary key #846

Closed ezwelty closed 2 years ago

ezwelty commented 3 years ago

A review of the resource metadata in src/pudl/package_data/meta/datapkg/datapackage.json reveals the following resources are missing primary keys, an attribute required by the new harvest process (#806).

Do additional resource primary keys exist elsewhere than src/pudl/package_data/meta/datapkg/datapackage.json?

Our task is to assign a primary key for each of these resources, and if that isn't possible, determine if and how the harvest process needs to be amended.

resource fields
fuel_ferc1 record_id, utility_id_ferc1, report_year, plant_name_ferc1, fuel_type_code_pudl, fuel_unit, fuel_qty_burned, fuel_mmbtu_per_unit, fuel_cost_per_unit_burned, fuel_cost_per_unit_delivered, fuel_cost_per_mmbtu
load_curves_epaipm region_id_epaipm, month, day_of_year, hour, time_index, load_mw
plant_region_map_epaipm plant_id_eia, region
generation_eia923 plant_id_eia, generator_id, report_date, net_generation_mwh
plants_hydro_ferc1 record_id, utility_id_ferc1, report_year, plant_name_ferc1, project_num, plant_type, construction_type, construction_year, installation_year, capacity_mw, peak_demand_mw, plant_hours_connected_while_generating, net_capacity_favorable_conditions_mw, net_capacity_adverse_conditions_mw, avg_num_employees, net_generation_mwh, capex_land, capex_structures, capex_facilities, capex_equipment, capex_roads, asset_retirement_cost, capex_total, capex_per_mw, opex_operations, opex_water_for_power, opex_hydraulic, opex_electric, opex_generation_misc, opex_rents, opex_engineering, opex_structures, opex_dams, opex_plant, opex_misc_plant, opex_total, opex_per_mwh
purchased_power_ferc1 record_id, utility_id_ferc1, report_year, seller_name, purchase_type, tariff, billing_demand_mw, non_coincident_peak_demand_mw, coincident_peak_demand_mw, purchased_mwh, received_mwh, delivered_mwh, demand_charges, energy_charges, other_charges, total_settlement
generators_eia860 plant_id_eia, generator_id, report_date, operational_status_code, operational_status, ownership_code, utility_id_eia, capacity_mw, summer_capacity_mw, winter_capacity_mw, energy_source_code_1, energy_source_code_2, energy_source_code_3, energy_source_code_4, energy_source_code_5, energy_source_code_6, fuel_type_code_pudl, multiple_fuels, deliver_power_transgrid, syncronized_transmission_grid, turbines_num, planned_modifications, planned_net_summer_capacity_uprate_mw, planned_net_winter_capacity_uprate_mw, planned_uprate_date, planned_net_summer_capacity_derate_mw, planned_net_winter_capacity_derate_mw, planned_derate_date, planned_new_prime_mover_code, planned_energy_source_code_1, planned_repower_date, other_planned_modifications, other_modifications_date, planned_retirement_date, carbon_capture, startup_source_code_1, startup_source_code_2, startup_source_code_3, startup_source_code_4, technology_description, turbines_inverters_hydrokinetics, time_cold_shutdown_full_load_code, planned_new_capacity_mw, cofire_fuels, switch_oil_gas, nameplate_power_factor, minimum_load_mw, uprate_derate_during_year, uprate_derate_completed_date, current_planned_operating_date, summer_estimated_capability_mw, winter_estimated_capability_mw, retirement_date
ownership_eia860 report_date, utility_id_eia, plant_id_eia, generator_id, owner_utility_id_eia, owner_name, owner_state, owner_city, owner_street_address, owner_zip_code, fraction_owned
boiler_generator_assn_eia860 plant_id_eia, report_date, generator_id, boiler_id, unit_id_eia, unit_id_pudl, bga_source
plants_eia860 plant_id_eia, report_date, ash_impoundment, ash_impoundment_lined, ash_impoundment_status, energy_storage, ferc_cogen_docket_no, ferc_exempt_wholesale_generator_docket_no, ferc_small_power_producer_docket_no, liquefied_natural_gas_storage, natural_gas_local_distribution_company, natural_gas_storage, natural_gas_pipeline_name_1, natural_gas_pipeline_name_2, natural_gas_pipeline_name_3, net_metering, pipeline_notes, regulatory_status_code, transmission_distribution_owner_id, transmission_distribution_owner_name, transmission_distribution_owner_state, utility_id_eia, water_source
generation_fuel_eia923 plant_id_eia, report_date, nuclear_unit_id, fuel_type, fuel_type_code_pudl, fuel_type_code_aer, prime_mover_code, fuel_consumed_units, fuel_consumed_for_electricity_units, fuel_mmbtu_per_unit, fuel_consumed_mmbtu, fuel_consumed_for_electricity_mmbtu, net_generation_mwh
plants_small_ferc1 record_id, utility_id_ferc1, report_year, plant_name_original, plant_name_ferc1, plant_type, ferc_license_id, construction_year, capacity_mw, peak_demand_mw, net_generation_mwh, total_cost_of_plant, capex_per_mw, opex_total, opex_fuel, opex_maintenance, fuel_type, fuel_cost_per_mmbtu
plants_pumped_storage_ferc1 record_id, utility_id_ferc1, report_year, plant_name_ferc1, project_num, construction_type, construction_year, installation_year, capacity_mw, peak_demand_mw, plant_hours_connected_while_generating, plant_capability_mw, avg_num_employees, net_generation_mwh, energy_used_for_pumping_mwh, net_load_mwh, capex_land, capex_structures, capex_facilities, capex_wheels_turbines_generators, capex_equipment_electric, capex_equipment_misc, capex_roads, asset_retirement_cost, capex_total, capex_per_mw, opex_operations, opex_water_for_power, opex_pumped_storage, opex_electric, opex_generation_misc, opex_rents, opex_engineering, opex_structures, opex_dams, opex_plant, opex_misc_plant, opex_production_before_pumping, opex_pumping, opex_total, opex_per_mwh
boiler_fuel_eia923 plant_id_eia, boiler_id, fuel_type_code, fuel_type_code_pudl, report_date, fuel_consumed_units, fuel_mmbtu_per_unit, sulfur_content_pct, ash_content_pct
hourly_emissions_epacems state, plant_id_eia, unitid, operating_datetime_utc, operating_time_hours, gross_load_mw, steam_load_1000_lbs, so2_mass_lbs, so2_mass_measurement_code, nox_rate_lbs_mmbtu, nox_rate_measurement_code, nox_mass_lbs, nox_mass_measurement_code, co2_mass_tons, co2_mass_measurement_code, heat_content_mmbtu, facility_id, unit_id_epa
transmission_single_epaipm region_from, region_to, firm_ttc_mw, nonfirm_ttc_mw, tariff_mills_kwh
plants_steam_ferc1 record_id, utility_id_ferc1, report_year, plant_id_ferc1, plant_name_ferc1, plant_type, construction_type, construction_year, installation_year, capacity_mw, peak_demand_mw, plant_hours_connected_while_generating, plant_capability_mw, water_limited_capacity_mw, not_water_limited_capacity_mw, avg_num_employees, net_generation_mwh, capex_land, capex_structures, capex_equipment, capex_total, capex_per_mw, opex_operations, opex_fuel, opex_coolants, opex_steam, opex_steam_other, opex_transfer, opex_electric, opex_misc_power, opex_rents, opex_allowances, opex_engineering, opex_structures, opex_boiler, opex_plants, opex_misc_steam, opex_production_total, opex_per_mwh, asset_retirement_cost
utilities_eia860 utility_id_eia, report_date, street_address, city, state, zip_code, plants_reported_owner, plants_reported_operator, plants_reported_asset_manager, plants_reported_other_relationship
transmission_joint_epaipm joint_constraint_id, region_from, region_to, firm_ttc_mw, nonfirm_ttc_mw
zaneselvans commented 3 years ago

There's a subset of the tables which end up with an auto-incrementing surrogate key because they don't have natural primary keys. The one that jumps to mind right now is fuel_receipts_costs_eia923 which I notice isn't on this list, so that's interesting.

The autoincrementing surrogate keys are currently implemented inelegantly, in the process of loading the data packages into SQLite. Previously it was handled automatically by PostgreSQL. We briefly considered adding the autoincrementing surrogate keys inside of the ETL process, but didn't go down that road. Maybe it makes sense to revisit that now?

We need to decide what we're doing with all the IPM tables going forward. We've been talking about deprecating, and this would probably be the time to commit to that.

No natural primary key and we can fix them:

No natural primary key and we need a surrogate key

EIA/EPA Tables that actually do have natural primary keys

FERC 1 Tables that actually do have natural primary keys

zaneselvans commented 3 years ago

Side note: there are many foreign key relationships which are not enumerated in the current version of the metadata, which we should absolutely add. They include but probably aren't limited to the following:

cmgosnell commented 3 years ago

the tables which we currently use an auto-incremented id column for should are the ones in the autoincrement dictionary in the meta/datapkg/datapackage.json. I agree w/ zane that a few of these tables should be converted to have true PKs, but for the rest of them their PK should just be the id column. And truly if you needed a quick implementation for these tables that need some munging to use natural PKs, the PK right now is effectively the id column

ezwelty commented 3 years ago

There's a subset of the tables which end up with an auto-incrementing surrogate key because they don't have natural primary keys. The one that jumps to mind right now is fuel_receipts_costs_eia923 which I notice isn't on this list, so that's interesting.

fuel_receipts_costs_eia923 is unique in that it shows up with "fields": [ "id", ...] and "primaryKey": ["id"] in datapackage.json. I presume this was a mistake?

@cmgosnell My understanding is that the id surrogate column only comes into existence upon database import, and thus after Python ETL and datapackage creation?

No natural primary key and we need a surrogate key

Primary keys are not required. What is the purpose, for PUDL, of using auto-incrementing surrogates as primary keys? They neither ensure row uniqueness nor can they be used for table joins. I feel like the default should either be no primary key or a primary key composed of every field in the table. SQLITE uses ROWID in the absence of an explicit primary key (https://www.sqlite.org/lang_createtable.html#rowid).

If auto-incrementing keys are needed, it seems worth populating these in the ETL so that the Python, data package, and database outputs all match in form and content (at the expense of a larger dataset). This also gives you the option of using the surrogates in foreign keys if you ever dynamically deduce relationships involving these tables.

@zaneselvans I've added the keys you listed in "EIA/EPA Tables that actually do have natural primary keys". I'll address foreign keys in a separate issue.

cmgosnell commented 3 years ago

right now the id cols get added in the load step. But we could certainly move that to happen in the transform step instead.

It does this by looking at the metadata, which really doesn't happen in the E/T steps right now so the load step felt like a logical place to do it... but that is going to change with this new harvesting process.

ezwelty commented 3 years ago

@cmgosnell It seems cleaner to export data packages with id as column and primary key (and dropping the autoincrement attribute from the package's datapackage.json), if that is what is also being used in database form. I can amend the harvest process to set the primary key fields as index or, if not provided, rename the default pandas integer index to id. Then, either way, you just df.to_csv().

@zaneselvans All the primary keys in your "EIA/EPA Tables that actually do have natural primary keys" are listed in autoincrement. @cmgosnell Can you confirm that these tables are in fact ready for their natural primary key?

zaneselvans commented 3 years ago

IIRC @ezwelty I think the fuel_receipts_costs_eia923 table was a special case, where we had to add the surrogate key during the transform step because of the way that we were ripping the coalmines_eia923 table out of it. I could be mistaken on that though.

When I made that list of tables with (apparent) natural keys I was pulling the tables from the DB directly and verifying their uniqueness so... hopefully they really do work.

Again IIRC, a lot of these shenanigans came up because PostgreSQL and SQLite deal with tables lacking explicit primary keys differently, and we were trying to preserve the ability for the datapackages to get loaded into either DB, when we were transitioning away from loading into PostgreSQL. Postgres automatically added the surrogate keys, but SQLite did something different, and it broke stuff. But I don't recall what stuff.

ezwelty commented 3 years ago

@zaneselvans What about transmission_single_epaipm and transmission_joint_epaipm? Do they have a natural primary key?