catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Can foreign keys be populated (semi) automatically? #854

Closed ezwelty closed 3 years ago

ezwelty commented 3 years ago

Side note: there are many foreign key relationships which are not enumerated in the current version of the metadata, which we should absolutely add. They include but probably aren't limited to the following:

  • utility_id_eia or owner_utility_id_eia => utilities_entity_eia.utility_id_eia
  • [report_date, utility_id_eia or owner_utility_id_eia] => utilities_eia860.(report_date, utility_id_eia)
  • plant_id_eia => plants_entity_eia.plant_id_eia
  • [report_date, plant_id_eia] => plants_eia860.(report_date, plant_id_eia)
  • [plant_id_eia, generator_id] => generators_entity_eia.(plant_id_eia, generator_id)
  • [report_date, plant_id_eia, generator_id] => generators_eia860.(report_date, plant_id_eia, generator_id)
  • [plant_id_eia, boiler_id] => boilers_entity_eia.(plant_id_eia, boiler_id)

Originally posted by @zaneselvans in https://github.com/catalyst-cooperative/pudl/issues/846#issuecomment-742178851

It seems risky to populate all foreign key relationships manually. Leveraging the fact that currently local fields of the same name all map to the same reference primary key (probably a good convention to maintain), we can succinctly express foreign keys as a 1:1 mapping between local fields and reference primary key (which typically, has the names as the local fields):

<local_fields> => <reference_name>.(reference_fields: optional)

Based on these rules, we can generate all candidate foreign keys, resolve key chains, and prune redundant keys. For example, for resource generation_eia923, the candidates:

Resolve to:

We prune key [1] because it is a subset of key [0], so we finally have:

Using this technique, we find all existing foreign keys, as well as several that were missed (see below). Are these all valid?

Are you interested in my using this technique to populate foreign keys automatically? A compromise would be a helper method that reveals potential missing keys from a succinct mapping, but requires a human to type the foreign keys into the metadata.

# - `accumulated_depreciation_ferc1`.(line_id) => `ferc_depreciation_lines`.(line_id)
# - `accumulated_depreciation_ferc1`.(utility_id_ferc1) => `utilities_ferc1`.(utility_id_ferc1)
# - `boiler_fuel_eia923`.(fuel_type_code) => `fuel_type_eia923`.(abbr)
+ - `boiler_fuel_eia923`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
# - `boiler_generator_assn_eia860`.(generator_id, plant_id_eia) => `generators_entity_eia`.(generator_id, plant_id_eia)
# - `boilers_entity_eia`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
+ - `boilers_entity_eia`.(prime_mover_code) => `prime_movers_eia923`.(abbr)
# - `fuel_ferc1`.(plant_name_ferc1, utility_id_ferc1) => `plants_ferc1`.(plant_name_ferc1, utility_id_ferc1)
# - `fuel_receipts_costs_eia923`.(energy_source_code) => `energy_source_eia923`.(abbr)
# - `fuel_receipts_costs_eia923`.(mine_id_pudl) => `coalmine_eia923`.(mine_id_pudl)
# - `fuel_receipts_costs_eia923`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
# - `fuel_receipts_costs_eia923`.(primary_transportation_mode_code) => `transport_modes_eia923`.(abbr)
# - `fuel_receipts_costs_eia923`.(secondary_transportation_mode_code) => `transport_modes_eia923`.(abbr)
+ - `fuel_type_aer_eia923`.(fuel_type) => `fuel_type_eia923`.(abbr)
# - `generation_eia923`.(generator_id, plant_id_eia) => `generators_entity_eia`.(generator_id, plant_id_eia)
# - `generation_fuel_eia923`.(fuel_type) => `fuel_type_eia923`.(abbr)
# - `generation_fuel_eia923`.(fuel_type_code_aer) => `fuel_type_aer_eia923`.(abbr)
# - `generation_fuel_eia923`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
# - `generation_fuel_eia923`.(prime_mover_code) => `prime_movers_eia923`.(abbr)
# - `generators_eia860`.(generator_id, plant_id_eia) => `generators_entity_eia`.(generator_id, plant_id_eia)
# - `generators_eia860`.(utility_id_eia) => `utilities_entity_eia`.(utility_id_eia)
# - `generators_entity_eia`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
+ - `generators_entity_eia`.(prime_mover_code) => `prime_movers_eia923`.(abbr)
+ - `hourly_emissions_epacems`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
# - `load_curves_epaipm`.(region_id_epaipm) => `regions_entity_epaipm`.(region_id_epaipm)
# - `ownership_eia860`.(generator_id, plant_id_eia) => `generators_entity_eia`.(generator_id, plant_id_eia)
# - `ownership_eia860`.(utility_id_eia) => `utilities_entity_eia`.(utility_id_eia)
# - `plant_in_service_ferc1`.(utility_id_ferc1) => `utilities_ferc1`.(utility_id_ferc1)
+ - `plant_region_map_epaipm`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
# - `plant_region_map_epaipm`.(region) => `regions_entity_epaipm`.(region_id_epaipm)
# - `plants_eia860`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
+ - `plants_eia860`.(utility_id_eia) => `utilities_entity_eia`.(utility_id_eia)
+ - `plants_eia`.(plant_id_eia) => `plants_entity_eia`.(plant_id_eia)
# - `plants_eia`.(plant_id_pudl) => `plants_pudl`.(plant_id_pudl)
# - `plants_ferc1`.(plant_id_pudl) => `plants_pudl`.(plant_id_pudl)
# - `plants_ferc1`.(utility_id_ferc1) => `utilities_ferc1`.(utility_id_ferc1)
# - `plants_hydro_ferc1`.(plant_name_ferc1, utility_id_ferc1) => `plants_ferc1`.(plant_name_ferc1, utility_id_ferc1)
# - `plants_pumped_storage_ferc1`.(plant_name_ferc1, utility_id_ferc1) => `plants_ferc1`.(plant_name_ferc1, utility_id_ferc1)
+ - `plants_small_ferc1`.(fuel_type) => `fuel_type_eia923`.(abbr)
+ - `plants_small_ferc1`.(plant_name_ferc1, utility_id_ferc1) => `plants_ferc1`.(plant_name_ferc1, utility_id_ferc1)
# - `plants_small_ferc1`.(plant_name_original, utility_id_ferc1) => `plants_ferc1`.(plant_name_ferc1, utility_id_ferc1)
# - `plants_steam_ferc1`.(plant_name_ferc1, utility_id_ferc1) => `plants_ferc1`.(plant_name_ferc1, utility_id_ferc1)
+ - `purchased_power_ferc1`.(utility_id_ferc1) => `utilities_ferc1`.(utility_id_ferc1)
# - `transmission_joint_epaipm`.(region_from) => `regions_entity_epaipm`.(region_id_epaipm)
# - `transmission_joint_epaipm`.(region_to) => `regions_entity_epaipm`.(region_id_epaipm)
# - `transmission_single_epaipm`.(region_from) => `regions_entity_epaipm`.(region_id_epaipm)
# - `transmission_single_epaipm`.(region_to) => `regions_entity_epaipm`.(region_id_epaipm)
# - `utilities_eia860`.(utility_id_eia) => `utilities_entity_eia`.(utility_id_eia)
+ - `utilities_eia`.(utility_id_eia) => `utilities_entity_eia`.(utility_id_eia)
# - `utilities_eia`.(utility_id_pudl) => `utilities_pudl`.(utility_id_pudl)
# - `utilities_ferc1`.(utility_id_pudl) => `utilities_pudl`.(utility_id_pudl)
# - `utility_plant_assn`.(plant_id_pudl) => `plants_pudl`.(plant_id_pudl)
# - `utility_plant_assn`.(utility_id_pudl) => `utilities_pudl`.(utility_id_pudl)
zaneselvans commented 3 years ago

Yeah, when I was enumerating all those foreign key relations I was thinking it really should be semi-automated. I think some version of this is probably a good idea, and it's an additional benefit of the naming conventions we've adopted.

I'm not familiar with the concept of "key chains." Is this where you have a composite key in table A that refers to fields in table B, and then some subset of that composite key in table B refers to table C? And rather than specifying foreign keys from A to B, from B to C, and from A to C, you rely on the A to B and B to C references to create an implicit A to C relationship? Is that how one typically deals with these kinds of relationships, so as not to specify the same mapping more than once?

On the many code/abbr fields, personally I would like to move toward using readable codes directly and specifying the acceptable values as ENUMs, rather than referring to these small tables that contain a code and a readable name. But I think @cmgosnell might feel differently.

Then we would only need to specify the special cases by hand, like where we have owner_utility_id_eia in the ownership_eia860 table, which refers to utilities_entity_eia.utility_id_eia I wouldn't be surprised if there's also some mess in the ferc1 plants / names / original names that doesn't quite work as expected.

ezwelty commented 3 years ago

That's right, SQL databases follow foreign key relationships. Here is an example of this behavior, for a "DELETE CASCADE": https://www.db-fiddle.com/f/iu6J88Mv7JG4oUV3JsPxyf/2

In a database context, whether redundant foreign keys are explicitly named and indexed is mostly a matter of storage vs speed. For our purposes, results in a clearer (and easier to validate) representation of the relationships. In my example, keeping the generation_eia923 -> plants_entity_eia foreign key suggests that generation_eia923 can have a plant_id_eia that is in plants_entity_eia but not in generators_entity_eia. But the path to plants_entity_eia via generators_entity_eia already tells us that this cannot be the case.

It sounds like you are in favor of condensing foreign keys to a set of rules. In that case, do you have any preference on the format for the human-maintained instructions.

  1. Local to reference: compact

{ ( local_fields , ): [ reference_name, (optional:reference_fields , ) ] }

{
  ('utility_id_eia', ): ['utilities_entity_eia'],
  ('owner_utility_id_eia', ): ['utilities_entity_eia', ('utility_id_eia', )],
}
  1. Local to reference: explicit

Same as above, but clearer at the expense of longer.

[ { 'fields': [ local_fields ], 'reference': { 'resource': reference_name, 'fields': [ optional:reference_fields ] } } ]

[
  {
    'fields': ['utility_id_eia'],
    'reference': {'resource': 'utilities_entity_eia'}
  },
  {
    'fields': ['owner_utility_id_eia'],
    'reference': {'resource': 'utilities_entity_eia', 'fields': ['utility_id_eia']
  },
}
  1. Reference to locals: compact

The reverse mapping. The advantage of this approach is that the reference fields, which are always the primary key, do not need to be named explicitly (whether this is desirable is another question), and there is more control over which local resources the rule applies to.

{ ( resource_name, ( optional:reference_fields, ) ): [ [ ( local_fields, ), optional:reference_name ) ] ] }

{
  'utilities_entity_eia': [[('utility_id_eia', )], [('owner_utility_id_eia', )]]
}
  1. Reference to locals: explicit

Same as above, but clearer at the expense of longer.

[ { 'resource': resource_name, 'fields': [ optional:reference_fields ], links: [ { 'fields': [ local_fields ], 'resource': [ optional:reference_name } ] } ]

[
  {
    'resource': 'utilities_entity_eia',
    # each link applies to all matching resources, unless 'resource' is specified
    'links': [{'fields': ['utility_id_eia']}, {'fields': ['owner_utility_id_eia']}]
  },
]
ezwelty commented 3 years ago

I settled on including the rules in the raw resource metadata. For example:

"plants_ferc1": {
    "title": "FERC 1 Plants",
    "schema": {
        "fields": ["utility_id_ferc1", "plant_name_ferc1", "plant_id_pudl"],
        "primaryKey": ["utility_id_ferc1", "plant_name_ferc1"],
        "foreignKeyRules": {"fields": [
            ["utility_id_ferc1", "plant_name_ferc1"],
            ["utility_id_ferc1", "plant_name_original"]
        ]},
    },
}

A resource's foreignKeyRules (if present) determines which other resources will be assigned a foreign key (foreignKeys) to the reference's primary key: