Closed ezwelty closed 3 years ago
Yeah, when I was enumerating all those foreign key relations I was thinking it really should be semi-automated. I think some version of this is probably a good idea, and it's an additional benefit of the naming conventions we've adopted.
I'm not familiar with the concept of "key chains." Is this where you have a composite key in table A that refers to fields in table B, and then some subset of that composite key in table B refers to table C? And rather than specifying foreign keys from A to B, from B to C, and from A to C, you rely on the A to B and B to C references to create an implicit A to C relationship? Is that how one typically deals with these kinds of relationships, so as not to specify the same mapping more than once?
On the many code/abbr fields, personally I would like to move toward using readable codes directly and specifying the acceptable values as ENUMs, rather than referring to these small tables that contain a code and a readable name. But I think @cmgosnell might feel differently.
Then we would only need to specify the special cases by hand, like where we have owner_utility_id_eia
in the ownership_eia860
table, which refers to utilities_entity_eia.utility_id_eia
I wouldn't be surprised if there's also some mess in the ferc1 plants / names / original names that doesn't quite work as expected.
That's right, SQL databases follow foreign key relationships. Here is an example of this behavior, for a "DELETE CASCADE": https://www.db-fiddle.com/f/iu6J88Mv7JG4oUV3JsPxyf/2
In a database context, whether redundant foreign keys are explicitly named and indexed is mostly a matter of storage vs speed. For our purposes, results in a clearer (and easier to validate) representation of the relationships. In my example, keeping the generation_eia923
-> plants_entity_eia
foreign key suggests that generation_eia923
can have a plant_id_eia
that is in plants_entity_eia
but not in generators_entity_eia
. But the path to plants_entity_eia
via generators_entity_eia
already tells us that this cannot be the case.
generation_eia923
.(plant_id_eia, generator_id) -> generators_entity_eia
.(plant_id_eia, generator_id)generators_entity_eia
.(plant_id_eia) -> plants_entity_eia
.(plant_id_eia)generation_eia923
.(plant_id_eia) -> plants_entity_eia
.(plant_id_eia)It sounds like you are in favor of condensing foreign keys to a set of rules. In that case, do you have any preference on the format for the human-maintained instructions.
{ ( local_fields , ): [ reference_name, (optional:reference_fields , ) ] }
{
('utility_id_eia', ): ['utilities_entity_eia'],
('owner_utility_id_eia', ): ['utilities_entity_eia', ('utility_id_eia', )],
}
Same as above, but clearer at the expense of longer.
[ { 'fields': [ local_fields ], 'reference': { 'resource': reference_name, 'fields': [ optional:reference_fields ] } } ]
[
{
'fields': ['utility_id_eia'],
'reference': {'resource': 'utilities_entity_eia'}
},
{
'fields': ['owner_utility_id_eia'],
'reference': {'resource': 'utilities_entity_eia', 'fields': ['utility_id_eia']
},
}
The reverse mapping. The advantage of this approach is that the reference fields, which are always the primary key, do not need to be named explicitly (whether this is desirable is another question), and there is more control over which local resources the rule applies to.
{ ( resource_name, ( optional:reference_fields, ) ): [ [ ( local_fields, ), optional:reference_name ) ] ] }
{
'utilities_entity_eia': [[('utility_id_eia', )], [('owner_utility_id_eia', )]]
}
Same as above, but clearer at the expense of longer.
[ { 'resource': resource_name, 'fields': [ optional:reference_fields ], links: [ { 'fields': [ local_fields ], 'resource': [ optional:reference_name } ] } ]
[
{
'resource': 'utilities_entity_eia',
# each link applies to all matching resources, unless 'resource' is specified
'links': [{'fields': ['utility_id_eia']}, {'fields': ['owner_utility_id_eia']}]
},
]
I settled on including the rules in the raw resource metadata. For example:
"plants_ferc1": {
"title": "FERC 1 Plants",
"schema": {
"fields": ["utility_id_ferc1", "plant_name_ferc1", "plant_id_pudl"],
"primaryKey": ["utility_id_ferc1", "plant_name_ferc1"],
"foreignKeyRules": {"fields": [
["utility_id_ferc1", "plant_name_ferc1"],
["utility_id_ferc1", "plant_name_original"]
]},
},
}
A resource's foreignKeyRules
(if present) determines which other resources will be assigned a foreign key (foreignKeys
) to the reference's primary key:
fields
(List[List[str]]
): Sets of field names for which to create a foreign key. These are assumed to match the order of the reference's primary key fields.exclude
(Optional[List[str]]
): Names of resources to exclude.
Originally posted by @zaneselvans in https://github.com/catalyst-cooperative/pudl/issues/846#issuecomment-742178851
It seems risky to populate all foreign key relationships manually. Leveraging the fact that currently local fields of the same name all map to the same reference primary key (probably a good convention to maintain), we can succinctly express foreign keys as a 1:1 mapping between local fields and reference primary key (which typically, has the names as the local fields):
<local_fields> => <reference_name>.(reference_fields: optional)
energy_source_eia923
.(abbr)fuel_type_eia923
.(abbr)fuel_type_eia923
.(abbr)fuel_type_aer_eia923
.(abbr)ferc_depreciation_lines
coalmine_eia923
plants_entity_eia
generators_entity_eia
plants_pudl
plants_ferc1
plants_ferc1
.(plant_name_ferc1, utility_id_ferc1)transport_modes_eia923
.(abbr)prime_movers_eia923
.(abbr)regions_entity_epaipm
.(region_id_epaipm)regions_entity_epaipm
.(region_id_epaipm)regions_entity_epaipm
regions_entity_epaipm
.(region_id_epaipm)transport_modes_eia923
.(abbr)utilities_entity_eia
utilities_ferc1
utilities_pudl
Based on these rules, we can generate all candidate foreign keys, resolve key chains, and prune redundant keys. For example, for resource
generation_eia923
, the candidates:generation_eia923
.(plant_id_eia, generator_id) ->generators_entity_eia
.(plant_id_eia, generator_id)generation_eia923
.(plant_id_eia) ->plants_entity_eia
.(plant_id_eia)Resolve to:
generation_eia923
.(plant_id_eia, generator_id) ->generators_entity_eia
.(plant_id_eia, generator_id)generators_entity_eia
.(plant_id_eia) ->plants_entity_eia
.(plant_id_eia)generation_eia923
.(plant_id_eia) -> plants_entity_eia.(plant_id_eia)We prune key [1] because it is a subset of key [0], so we finally have:
Using this technique, we find all existing foreign keys, as well as several that were missed (see below). Are these all valid?
Are you interested in my using this technique to populate foreign keys automatically? A compromise would be a helper method that reveals potential missing keys from a succinct mapping, but requires a human to type the foreign keys into the metadata.