catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Review inconsistent field attributes for fields of the same name #845

Closed ezwelty closed 2 years ago

ezwelty commented 3 years ago

A review of the field metadata in src/pudl/package_data/meta/datapkg/datapackage.json reveals the following inconsistencies in constraints.enum and description for fields of the same name across different resources.

Does additional field metadata exist elsewhere than src/pudl/package_data/meta/datapkg/datapackage.json?

Our task is to choose one standard attribute value where possible, rename the field in some resources as needed, and determine whether field metadata can be universal or needs to be customizable by resource (e.g. service_territory_eia861, respondent_id_ferc714).

Fields are removed from the tables below as they are resolved:

field constraints.enum resources
state AL, AR, AZ, CA, CO, CT, DC, DE, FL, GA, IA, ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MS, MT, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA, RI, SC, SD, TN, TX, UT, VA, VT, WA, WI, WV, WY hourly_emissions_epacems
  AUS, COL, CAN, IDN, POL, RUS, GBR, VEN, other_country, unknown, AK, AL, AR, AS, AZ, CA, CO, CT, DC, DE, FL, GA, GU, HI, IA, ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MP, MS, MT, NA, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA, PR, RI, SC, SD, TN, TX, UT, VA, VI, VT, WA, WI, WV, WY coalmine_eia923
field description resources
avg_num_employees Average number of employees. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Average number of plant employees during report year. plants_steam_ferc1
boiler_id Boiler identification code. Alphanumeric. boiler_fuel_eia923
  EIA-assigned boiler identification code. boiler_generator_assn_eia860
  The EIA-assigned boiler identification code. Alphanumeric. boilers_entity_eia
capacity_mw Name plate capacity in megawatts. plants_small_ferc1
  The highest value on the generator nameplate in megawatts rounded to the nearest tenth. generators_eia860
  Total installed (nameplate) capacity, in megawatts. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Total installed plant capacity in MW. plants_steam_ferc1
capex_per_mw Capital expenses per MW of installed plant capacity. plants_steam_ferc1
  Cost of plant per megawatt of installed (nameplate) capacity. Nominal USD. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Plant costs (including asset retirement costs) per megawatt. Nominal USD. plants_small_ferc1
city Name of the city in which operator/owner is located utilities_eia860
  The plant's city. plants_entity_eia
construction_year Four digit year of the plant's original construction. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Original year of plant construction. plants_small_ferc1
  Year the plant's oldest still operational unit was built. plants_steam_ferc1
description Description of the FERC depreciation account, as listed on FERC Form 1, Page 219. ferc_depreciation_lines
  Long description of the FERC Account. ferc_accounts
fuel_cost_per_mmbtu All costs incurred in the purchase and delivery of the fuel to the plant in cents per million Btu(MMBtu) to the nearest 0.1 cent. fuel_receipts_costs_eia923
  Average cost of fuel consumed in the report year, in nominal USD per mmBTU of fuel heat content. fuel_ferc1
  Average fuel cost per mmBTU (if applicable). Nominal USD. plants_small_ferc1
fuel_mmbtu_per_unit Average heat content of fuel consumed in the report year, in mmBTU per reported fuel unit. fuel_ferc1
  Heat content of the fuel in millions of Btus per physical unit. generation_fuel_eia923, boiler_fuel_eia923
fuel_type Kind of fuel. Originally reported to FERC as a freeform string. Assigned a canonical value by PUDL based on our best guess. plants_small_ferc1
  The fuel code reported to EIA. Two or three letter alphanumeric. generation_fuel_eia923
fuel_type_code_pudl PUDL assigned code indicating the general fuel type. fuel_ferc1
  Standardized fuel codes in PUDL. generators_eia860, fuel_receipts_costs_eia923, generation_fuel_eia923, boiler_fuel_eia923
generator_id EIA-assigned generator identification code. boiler_generator_assn_eia860
  Generator identification code. Often numeric, but sometimes includes letters. It's a string! generation_eia923
  Generator identification number generators_entity_eia
  Generator identification number. generators_eia860, ownership_eia860
installation_year Four digit year in which the last unit was installed. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Year the plant's most recently built unit was installed. plants_steam_ferc1
line_id A human readable string uniquely identifying the FERC depreciation account. Used in lieu of the actual line number, as those numbers are not guaranteed to be consistent from year to year. ferc_depreciation_lines
  Line numbers, and corresponding FERC account number from FERC Form 1, page 2019, Accumulated Provision for Depreciation of Electric Utility Plant (Account 108). accumulated_depreciation_ferc1
net_generation_mwh Net generation (exclusive of plant use) in MWh during report year. plants_steam_ferc1
  Net generation excluding plant use, in megawatt-hours. plants_small_ferc1
  Net generation for specified period in megawatthours (MWh). generation_eia923
  Net generation, exclusive of plant use, in megawatt hours. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Net generation, year to date in megawatthours (MWh). This is total electrical output net of station service. In the case of combined heat and power plants, this value is intended to include internal consumption of electricity for the purposes of a production process, as well as power put on the grid. generation_fuel_eia923
peak_demand_mw Net peak demand experienced by the plant in MW in report year. plants_steam_ferc1
  Net peak demand for 60 minutes. Note: in some cases peak demand for other time periods may have been reported instead, if hourly peak demand was unavailable. plants_small_ferc1
  Net peak demand on the plant (60-minute integration), in megawatts. plants_hydro_ferc1, plants_pumped_storage_ferc1
plant_capability_mw Net continuous plant capability in MW plants_steam_ferc1
  Net plant capability in megawatts. plants_pumped_storage_ferc1
plant_hours_connected_while_generating Hours the plant was connected to load while generating. plants_hydro_ferc1, plants_pumped_storage_ferc1
  Total number hours the plant was generated and connected to load during report year. plants_steam_ferc1
plant_name_ferc1 Name of the plant, as reported to FERC. This is a freeform string, not guaranteed to be consistent across references to the same plant. fuel_ferc1, plants_ferc1, plants_hydro_ferc1, plants_pumped_storage_ferc1, plants_steam_ferc1
  PUDL assigned simplified plant name. plants_small_ferc1
plant_type Kind of plant (Run-of-River or Storage). plants_hydro_ferc1
  PUDL assigned plant type. This is a best guess based on the fuel type, plant name, and other attributes. plants_small_ferc1
  Simplified plant type, categorized by PUDL based on our best guess of what was intended based on freeform string reported to FERC. Unidentifiable types are null. plants_steam_ferc1
prime_mover_code Code for the type of prime mover (e.g. CT, CG) boilers_entity_eia
  EIA assigned code for the prime mover (i.e. the engine, turbine, water wheel, or similar machine that drives an electric generator) generators_entity_eia
  Type of prime mover. generation_fuel_eia923
state Plant state. Two letter US state and territory abbreviations. plants_entity_eia
  State of the operator/owner utilities_eia860
  State the plant is located in. hourly_emissions_epacems
  Two letter US state abbreviations and three letter ISO-3166-1 country codes for international mines. coalmine_eia923
street_address Plant street address plants_entity_eia
  Street address of the operator/owner utilities_eia860
utility_id_eia EIA-assigned identification number for the company that is responsible for the day-to-day operations of the generator. generators_eia860, ownership_eia860, plants_eia860, utilities_eia860
  The EIA Utility Identification number. utilities_eia, utilities_entity_eia
zip_code Plant street address plants_entity_eia
  Zip code of the operator/owner utilities_eia860
zaneselvans commented 3 years ago

I almost certainly typed most of these back in the day.

cmgosnell commented 3 years ago

After reading through all of these, I agree w/ a lot of what zane said. Outside of the enums the columns in here that seem like they need to be broken out are:

I have apprehension about breaking out fuel_type into fuel_type_eia and fuel_type_ferc1 because we have conveniently squished them to be comparable/merge-able columns, but they are generated in very different ways.

ezwelty commented 3 years ago

All the unresolved fields listed above have been added to pudl.metadata.fields with only a name and a type. You can fill in their description and constraints.enum (or rename them) as you decide how to resolve them. As needed, you can override the default field metadata for a particular resource in the resource's metadata in pudl.metadata.resources:

'resource': {
    'schema': {
        'fields': ['default_field', {'name': 'custom_field', 'description': 'Custom description'}]
    }
}