Closed ezwelty closed 2 years ago
I almost certainly typed most of these back in the day.
state
will need to be reorganized in some cases I think, and the enums may need to be customized by field? For the US political subdivisions that have a 2 letter abbreviation, maybe we call these columns something like us_state_territory
? and that could include DC, Guam, Virgin Islands, Puerto Rico, etc. For coalmine_eia923
I think we probably want to split it into two columns, one which the 3-letter ISO country abbreviation, and another which is the US state or territory abbreviation. For the purposes of column value constraints, I suspect that we will be okay with having all of the 2-letter abbreviations of US political subdivisions. However there are some iteration cases (e.g. in the CEMS) where we want only to iterate over a subset of those abbreviations and that subset will need to be defined elsewhere. Oh also I think there are some places where Canadian provinces have snuck in, and at this point I've probably replaced them all with CAN, but if we have state-or-territory and also country... maybe we need to think about this more...ferc1
mostly all refer to the same thing, but they may be written differently on the different pages of the blank PDF, which is where these descriptions would have come from. We will need to review and make sure they actually mean the same thing between say, the steam and small plants tables, but I think in most cases they will.city
really is referring to two different things: where a plant is, vs. the city where a utility's offices / mailing address are. So these should maybe get renamed like... plant_city
and utility_city
or something like that?fuel_type
is fraught. There are many, many different categorizations. They should probably all indicate where they came from in the column name. E.g. fuel_type_eia
or fuel_type_ferc1
or fuel_type_pudl
generator_id
really is all the same, but it should probably become generator_id_eia
and all the descriptions should be the same and they are hella not numbers. definitely strings. But lots of numbers. And this fucks up the data types all the time when you only read a subset of them.line_id
-- looks weird.mine_id_pudl
should definitely be the same. Probably should combine the two -- yes it's a surrogate key that we made, and also it identifies a mine.plant_*_ferc1
there's some now abandoned table cleanup that was being applied to the small plants table, which we may need to remediate.plant_type
has different valid values in different plant tables. Might need to make more specific names, or combine all of the valid values across the tables into a single enum and a single more general plant type description.prime_mover_code
is all the same I think. We have also talked about replacing many of the inscrutable short codes with longer snake_case descriptions that are actually readable, and using an ENUM to constrain them.state
descriptions are referring to different things. Probably need to rename some columns utility/operator state vs. plant statestreet_address
ditto city and state.utility_id_eia
should all be the same, though only one of those descriptions would make sense in all contexts. There are cases in which there's an owner vs. operator distinction, and "utility" is kind of the generic name we've used when it doesn't matter.zip_code
also needs to differentiate plant vs. corporate entity.After reading through all of these, I agree w/ a lot of what zane said. Outside of the enums the columns in here that seem like they need to be broken out are:
I have apprehension about breaking out fuel_type
into fuel_type_eia
and fuel_type_ferc1
because we have conveniently squished them to be comparable/merge-able columns, but they are generated in very different ways.
All the unresolved fields listed above have been added to pudl.metadata.fields
with only a name
and a type
. You can fill in their description
and constraints.enum
(or rename them) as you decide how to resolve them. As needed, you can override the default field metadata for a particular resource in the resource's metadata in pudl.metadata.resources
:
'resource': {
'schema': {
'fields': ['default_field', {'name': 'custom_field', 'description': 'Custom description'}]
}
}
A review of the field metadata in
src/pudl/package_data/meta/datapkg/datapackage.json
reveals the following inconsistencies inconstraints.enum
anddescription
for fields of the same name across different resources.Does additional field metadata exist elsewhere than
src/pudl/package_data/meta/datapkg/datapackage.json
?Our task is to choose one standard attribute value where possible, rename the field in some resources as needed, and determine whether field metadata can be universal or needs to be customizable by resource (e.g.
service_territory_eia861
,respondent_id_ferc714
).Fields are removed from the tables below as they are resolved: