Closed zaneselvans closed 3 years ago
Hmm, @ezwelty I'm stuck on how to add good_codes
, fix_codes
and bad_codes
to the Pydantic metadata model. I added these fields in the Resource
class definition:
good_codes: List[String] = []
bad_codes: List[String] = []
fix_codes: Dict[String, String] = {}
And then also added these root validators to check that the relationships between those field values are as expected
@pydantic.root_validator
def _check_no_overlapping_codes(cls, values): # noqa: N805
"""Verify there's no overlap between good, fixable, and bad codes."""
if (
set(values["good_codes"]).intersection(values["bad_codes"])
or set(values["good_codes"]).intersection(values["fix_codes"])
or set(values["fix_codes"]).intersection(values["bad_codes"])
):
raise ValueError(
"Overlap found between good, bad, and fixable codes in\n"
f"{values['name']}. These must all be disjoint sets:\n"
f"{values['good_codes']=} \n"
f"{values['bad_codes']=} \n"
f"fix_codes.keys()={list(values['fix_codes'].keys())} \n"
)
@pydantic.root_validator
def _check_fixed_codes_are_good_codes(cls, values): # noqa: N805
"""Check that every every fixed code is also one of the good codes."""
if set(values["fix_codes"].values()).difference(values["good_codes"]):
raise ValueError(
"Some fixed codes aren't in the list of good codes:\n"
f"{values['good_codes']=} \n"
f"fix_codes.values()={list(values['fix_codes'].values())} \n"
)
But now I'm getting an error when I try to re-import pudl:
~/code/catalyst/pudl/src/pudl/metadata/classes.py in from_id(cls, x)
918 def from_id(cls, x: str) -> "Resource":
919 """Construct from PUDL identifier (`resource.name`)."""
--> 920 return cls(**cls.dict_from_id(x))
921
922 def to_sql(
~/miniconda3/envs/pudl-dev/lib/python3.9/site-packages/pydantic/main.cpython-39-x86_64-linux-gnu.so in pydantic.main.BaseModel.__init__()
ValidationError: 1 validation error for Resource
__root__
'NoneType' object is not subscriptable (type=type_error)
It seems like the default values of good_codes
fix_codes
and bad_codes
aren't being used when those elements aren't found in the resource metadata? Not sure what I'm doing wrong here. I tried assigning the default (empty) values inside the dict_from_id()
method when no value was passed in, but it didn't seem to make any difference.
But anyway, my lack of understanding exactly how to make this work in Pydantic, does this seem generally seem like a good approach @ezwelty?
@zaneselvans Pydantic validators need to return the value they are validating (and/or modifying). You didn't return a value, hence the __root__
being None. I fixed this in a commit.
A few thoughts:
pd.DataFrame.convert_dtypes()
instead of pd.DataFrame.astype(...)
if your goal is just to get column types that support pd.NA
. Or better, their specific column types can be set as needed downstream based on their metadata description.bad_codes
? I presume this is so that you can choose which ones to ignore and address new bad codes as they crop up?good_codes
since this is simply df['code']
, rename fix_codes
to fixes
(or code_fixes
) since it is an attribute, not a function, and rename bad_codes
to ignored
to clarify how these values are handled ("bad" is a bit vague). This could be part of Resource
, but since it has to do with transforming the values of a single column, I suspect it will fit better in Field
. Then Resource.dict_from_id()
would populate fields[*].encoder
based on foreign keys or whatever logic we choose, and use e.g. Resource.encode(df)
to apply all fields[*].encoder.encode()
. The metadata for this could be stored in RESOURCE_METADATA
under resource.encoder
or even resource.foreign_key_rules.encoder
. You could avoid imports of static dataframes in RESOURCE_METADATA
and instead look them up as resource.name.upper()
– or is that too much magic?class Encoder(Base):
df: pd.DataFrame
fixes: Dict[String, String] = {}
ignored: List[String] = []
# Check df (columns and dtypes)
# Check uniqueness of df['code'], ignored
# Check intersections of df['code'], fixes, ignored
# ...
@property
def code_map(self) -> Dict[str, Union[str, pd.NA]]:
# ...
def encode(self, ds: pd.Series) -> pd.Series:
# ...
class Field(Base):
# Field whose values are to be encoded
# ...
encoder: Encoder = None
class Resource(Base):
# Resource which represents a static dataframe of codes
# ...
encoder: Encoder = None
note: pd.DataFrame
as a type in Pydantic may require setting Base.Config.arbitrary_types_allowed = True
.
Oh duh. Return values. I thought it was probably something simple. I probably stayed up too late.
df.convert_dtypes()
before though. I think usually we're just trying to get NA compatible non-object types wherever possible.bad_codes
because we don't have control over the input data and they revise it and the schemas applied to it without any warning or documentation. If they change the set of codes that's being used (e.g. going from WA
to WT
for water, or CV
to TC
for "tramways and conveyors" or they add a new OP
code for "onsite production" -- all of which have happened) or a new unrecognized code that's actually fixable shows up in a new year of data, we want to be alerted to the change, and have the opportunity to update the metadata or develop a fix. At least in the EIA data, the number of bad codes has been very manageable so far, so this seems like a reasonable approach. Unfortunately FERC isn't so good.static_df
field to the Resource
model for the static resources, so that this data could be accessed through the classes rather than directly but ran into the arbitrary_types_allowed
issue and wasn't sure if this was really the right approach. Do you think it makes sense to store the static data directly inside the Resource so it can be accessed throughout the ETL process before we have a database to work with? All these static tables should be small.Encoder
be associated with the Field
it encodes rather than the whole Resource
makes sense. However, every static table has the same field names and many of them exist within the same (eia
) data group. I'm not seeing a way to override them on a per-table basis right now so that each of the code
columns can have its own distinct Encoder
. I guess I can just replicate the pattern of the group-level overrides but at the resource level.Encoder
would be associated with the home Field
that stores the codes, rather than having a separate Encoder
associated with every field that needs to be encoded, but above it seems like you're imagining the encoder being attached to each field that might be encoded? I guess in the problem I've been trying to solve these codes are foreign key values, so the information you need to have to find the valid codes is only available at the resource level, where the schema defines the foreign key relationship. But the same process could be applied to an ENUM
constrained field, if it also had a list of code_fixes
and ignored_codes
and in that case you'd want it to be attached directly to the field that's going to get cleaned up. But in the FK case we only want to define this information in one place, and the only obvious unique place to do that is in the home table/column for the good codes. Hmm.Well, I kind of got it working, but it seems messy. Would definitely appreciate your thoughts on the current version again @ezwelty.
Encoder
model itself seems good.FIELD_METADATA_BY_RESOURCE
is okay. Does this seem clearly better than associating the encoder information with the resource definition?Resource.encode()
method I feel like I must not understand how to access the data structures correctly. Needing to iterate over lists of fields or resources to find the one with a particular name, and converting these single-element lists into the standalone value feels harder to read than it needs to be. Is there a better way to access named elements that are stored in a list?Encoder
with every column that needs to be encoded, rather than just with the column that has all of the information about the encoding? That would move a lot of the logic that's in Resource.encode()
into Resource.dict_from_id()
. Would it be bad to have lots of different identical encoders floating around attached to different columns? Then using a resource to encode a dataframe could just be iterating over all of the columns that appear in both of them and calling the encoder if it's not None.fuel_transportation_modes_eia
table uses label
as its primary key, and the values in the dataframes/database are the human readable labels, not the original short codes. While the encoder
is attached to the code
columns. So purely using the foreign key relationship to choose what reference column to look at might be too restrictive. The original codes are the things that need to be fixed. If we want to use longer human readable labels then after they get fixed we need to map from the original codes to the labels, and the label
column would be the primary key / foreign key column.@zaneselvans There are two distinct concepts at play here:
df
attribute of Resource
, or based on a naming convention in Resource.from_id()
. But we don't need that for anything yet, and it seems secondary to the main design challenge here, which is:Encoder
be associated with the Field
it encodes, rather than with the Resource or Field to which the foreign key points. That way the encoding becomes just one type of transformation/validation, and Resource.encode()
(Field.encode()
) can raise errors that are resource/field-aware. That said, the structure of the underlying metadata need not match the classes that perform the metadata-driven operation (e.g. like the metadata {reference_resource}.foreign_key_rules
that bevomes Resource.Schema.foreign_keys
). In the metadata, I suspect that code_fixes
and ignored_codes
are most easily maintained alongside the static dataframe listing the codes and their descriptions.So maybe metadata.dfs
is too general, and we can focus on organizing the metadata for static code tables specifically. Lots of possible approaches, but one would be a metadata.codes
with entries like:
CONTRACT_TYPES_EIA = pd.DataFrame(...)
CONTRACT_TYPES_EIA_ENCODING = {
'fixes': {...},
'ignored': [...]
}
# or
CONTRACT_TYPES_EIA = {
'df': pd.DataFrame(...),
'fixes': {...},
'ignored': [...]
}
And then in metadata.classes
:
class Encoder(Base):
# ...
def dict_from_id(cls, name):
# Build the arguments to Encoder from the name of a PUDL code table.
class Package(Base):
# ...
def resources_from_id(cls, name):
# ...
# Build encoders for each resource that is a PUDL code table
# Assign each encoder to the resource fields to be encoded
Something to keep in mind – are the encoding rules the same for all the columns pointing to a PUDL code table, or are they different and incompatible between resource groups, or between resources (e.g. {'A': 'B'} for eia
columns but {'A': 'C'} for glue
columns)? If so, we might need a more complex metadata design.
And we should keep in mind that there may be other transform steps that will become formalized as metadata in the future. Then Encoder
would become one of several transformers that can be assigned to a Field (or a Resource, or a Package) ...
- The
Encoder
model itself seems good.
A few suggestions for Encoder
:
arbitrary_types_allowed = True
to Base
.@pydantic.validator('df')
to Encoder
to first check the columns df
._check_fixed_codes_are_good_codes()
to a @pydantic.validator('code_fixes')
with arguments (cls, value, values)
, where values
give you access to any already-validated attributes (here, df
). The check should only run if 'df' in values
(true only if df
was valid), otherwise the validator will fail for reasons other than an invalid code_fixes
. See https://pydantic-docs.helpmanual.io/usage/validators. Consider this for _check_no_overlapping_codes()
, which you may want to break up into individual validators: one for code_fixes
(combine with above) and one for ignored_codes
. The code is cleaner and results in clearer validation reports.Encoder.encode()
, consider adding a dtype=
argument and not type-casting by default so that we can type-cast directly to field.get_pandas_dtype()
from calling methods.
- Adding resource-level overrides via
FIELD_METADATA_BY_RESOURCE
is okay. Does this seem clearly better than associating the encoder information with the resource definition?
My understanding was that the encoding logic (i.e. code_fixes
and ignored_codes
) is the same for all encoded fields, regardless of which resource they are in? If so, in the metadata, it definitely makes more sense to me to place these alongside the code table (see above).
- In the
Resource.encode()
method I feel like I must not understand how to access the data structures correctly. Needing to iterate over lists of fields or resources to find the one with a particular name, and converting these single-element lists into the standalone value feels harder to read than it needs to be. Is there a better way to access named elements that are stored in a list?
If the encoder is assigned to the field, Resource.encode()
reduces to ~field.encoder.encode(df[field.name]) for field in resource.fields if field.encoder
. But to answer your question... Resource.fields
and Package.resources
could be converted from lists to dicts, but that means repeating the name as both the key and an attribute ({'plants_eia': {'name': 'plants_eia', ...}, ...}
) and making the order less explicit (which maybe is okay). Or we can add simple helper Resource.get_field(name)
and Package.get_resource(name)
to do what you seek.
def get_field(self, name):
names = [field.name for field in self.fields]
return self.fields[names.index(name)]
- Having to keep track of whether I've just got the dictionary/list/string version of an object, or the instantiated version of it feels kludgy.
Not sure I follow. All/most of the logic that bridges the metadata and the instantiated classes should be packed into the *dict_from_id()
methods (and the *.from_id()
methods that rely on them).
- Would it be bad to have lots of different identical encoders floating around attached to different columns?
At best, they would be the same object when instantiated and assigned in Package.from_resource_ids()
(see my example above). Arguably, anything related to foreign keys is best constructed there, and not in Resource.dict_from_id()
, because foreign keys don't make sense unless the referenced resource is also present...
- How much checking should we do to make sure that the DataFrame that's been passed to get encoded really corresponds to the table that's being described by the Resource?
Do you mean Encoder.df
? Then see my comments about that class. Otherwise, it seems like all we need to worry about right now is whether the column to be encoded encodes without error?
- Another issue that came up here is that currently the
fuel_transportation_modes_eia
table useslabel
as its primary key, and the values in the dataframes/database are the human readable labels, not the original short codes. While theencoder
is attached to thecode
columns. So purely using the foreign key relationship to choose what reference column to look at might be too restrictive. The original codes are the things that need to be fixed. If we want to use longer human readable labels then after they get fixed we need to map from the original codes to the labels, and thelabel
column would be the primary key / foreign key column.
Would that not be the case for all tables referencing a code table? My understanding was that code
is the (perhaps cleaned-up) original codes and label
(or pudl_code
, etc) is the more readable code actually used in the PUDL data. If label
is not used in the data instead of code
, what purpose does it serve?
All of the validations of codes depend on more than one attribute of the class. Why wouldn't we want to use a root_validator
in that situation? If we skip the validation check when one of the other attributes hasn't been validated yet (and so isn't in the values
dictionary yet), how do we know that the validation will ever happen?
It seemed like validation that depends on more than one attribute is what the root validator was being used for in their examples.
If it only makes sense to assign Encoders to the fields in the context of creating a whole package, what's the right way to use it in the context of a transform function? If we're inside the transform function for generation_fuel_eia923
then we know that that table is destined exist in the database, and we know that the static tables defining the codes for some of its columns will exist (because they're static, and they always exist), but we don't necessarily know all of the other tables that will exist in the output DB. We can create a full Package that reflects all of the possible tables, but I wonder if the mismatch between that Package and the tables that are actually going through the ETL process will create unexpected problems.
Also what if we can't simultaneously encode all of the columns within a table -- like if we're using information from the fuel_type_aer
column to fill in missing values in the energy_source_eia
column there might be a dependency between the columns -- some cleanup has to happen in one, before cleanup can happen in another.
If we associate Encoders with columns when Package.from_resource_ids()
is called, and we iterate over all of the fields (or foreign key relationships) for each resource, how do we identify which FK relations mean we need to add an encoder and how do we construct that encoder? It seems like it needs to be connected to the metadata somehow so we can look up the actual data associated with the encoding table.
From looking at a FK / the metadata, I can see that energy_source_1
is constrained by energy_sources_eia.code
but then I need to be able to look up the actual df
and code_fixes
and ignored_codes
with which to create the Encoder that will be associated with the Field for later use.
Your last two points might be better worked through on a call. For the first:
All of the validations of codes depend on more than one attribute of the class. Why wouldn't we want to use a root_validator in that situation? If we skip the validation check when one of the other attributes hasn't been validated yet (and so isn't in the values dictionary yet), how do we know that the validation will ever happen?
Well, the first Pydantic example uses a @validator('password2')
to check that it matches password
: https://pydantic-docs.helpmanual.io/usage/validators/#pre-and-per-item-validators. A @root_validator
would also work here, but I find @validator
clearer when tan attribute is dependent on others that need to be validated first. Plus the error report is labeled with attribute names, rather than __root__
.
from pydantic import BaseModel, validator, root_validator, StrictStr
class Example(BaseModel):
name: StrictStr
name2: StrictStr
name3: StrictStr
@validator('name2')
def names_match(cls, v, values):
if v != values['name']:
raise ValueError('does not match name')
return v
@validator('name3')
def names_match_again(cls, v, values):
if v != values['name']:
raise ValueError('does not match name')
return v
Example(name='ethan', name2='zane', name3='otto')
ValidationError: 2 validation errors for Example
name2
does not match name (type=value_error)
name3
does not match name (type=value_error)
class Example(BaseModel):
name: StrictStr
name2: StrictStr
name3: StrictStr
@root_validator
def names_match(cls, values):
if values['name'] != values['name2']:
raise ValueError('name 2 does not match name')
return values
@root_validator
def names_match_again(cls, values):
if values['name'] != values['name3']:
raise ValueError('name 3 does not match name')
return values
Example(name='ethan', name2='zane', name3='otto')
ValidationError: 2 validation errors for Example
__root__
name 2 does not match name (type=value_error)
__root__
name 3 does not match name (type=value_error)
What is definitely important is checking for an attribute in values
before accessing it. Otherwise, pydantic
crashes and you get nothing. So with either of my examples above, if name
is not a StrictStr
:
Example(name=123, name2='zane', name3='otto')
KeyError: 'name'
Doh!
I ended up making it possible to encode either a whole dataframe or just a single column at a time, in the event that there are dependencies between columns that need to be cleaned up.
On trying to figure out how to look up the encoding table for a given column, I ended up associating Encoders with both Resource/Table level metadata and Field/Column level metadata. If there's an Encoder attached to a Resource, then it's a coding table. If it's attached to a Field/Column, then it's the Encoder that should be used to encode that field/column.
I'm still confused about the root validator vs. the dependency between two fields in a field-level validator. In the password1
/ password2
example it feels like the dependency is clear -- the validity of password2
depends on the value of password1
but not vice-versa. Whereas with the good codes, ignored codes, and fixed/fixable codes, it feels like the dependency is mutual.
But more importantly -- and maybe there's magic going on that I'm not aware of in the background -- if the validation of one value is conditional on the existence of another value in the values
then how do you know that all of the values are actually getting validated? Like what are all the scenarios in which you could have some of the values defined and not others? Bad input values? Optional parameters? Etc.
Quoting from the pydantic documentation (https://pydantic-docs.helpmanual.io/usage/validators):
where validators rely on other values, you should be aware that:
- Validation is done in the order fields are defined. E.g. in the example above,
password2
has access topassword1
(andname
), butpassword1
does not have access topassword2
. See Field Ordering for more information on how fields are ordered- If validation fails on another field (or that field is missing) it will not be included in values, hence
if 'password1' in values and ...
in this example.
This seems like a reasonable default. If the first field is invalid (e.g. wrong type), there is no sense using it in further validations since that code will likely fail. For example, as written, the Encoder
validators assume that df
is in values
, and thus that it is a pd.DataFrame
with columns code
and description
. So these will crash if df
is invalid (just like they would if an invalid df
were included in values
). All you need to do is if 'df' in values
to know that you can proceed to access values['df']['code']
...
Okay, I finally made these changes and added docstrings as requested by @bendnorman in #1272. Going to merge it into dev.
I found a bunch of foreign key violations while compiling and standardizing the labeling tables (see #1252) , which isn't surprising, since we're linking up these coded columns with their home tables / canonical values for the first time and enforcing the relationship. The current arrangement is pretty ad-hoc, and I think we can take advantage of the new metadata setup to make it much, much better. Both more complete and easier to maintain and automatically document.
Current situation
transform
functions which were largely developed in notebooks, and still haven't been generalized.prime_mover_code
orenergy_source
, etc.) but there's nothing ensuring that they're all processed the same way, or that we've caught all of the columns that should be getting that treatment.df[col].replace()
anddf[col].map()
operations happening. Replace is slow, but lets you change some values without clobbering others. Map is fast, but requires that every single existing value have a corresponding new value, or be set to NA. In some cases we've inadvertently wiped out entire columns like this.A slight improvement
enum
constrained columns), and also what values they ought to be mapped to (if we're translating a code to a readable label).map()
rather thanreplace()
and still be sure that we aren't accidentally wiping out something we don't understand. This is whatpudl.helpers.standardize_codes()
does now:For a given set of codes (here
ENERGY_SOURCES_EIA
) and a given table (generators_eia860
), we can look up all the columns that those codes should be applied to:And then iterate over those columns, cleaning them up:
If we want to, then we can map the codes to labels (here putting the labels in a new columns):
A bigger improvement
Rather than applying this process to one set of coded columns at a time, it could be automatically applied to every column in the dataframe that has a FK relationship pointing to one of the static label/code tables, analogous to the
Resource.format_df()
method. For that to work we'd need:fix_codes
andbad_codes
in close association with the table/dataframe that contains thegood_codes
RESOURCE_METADATA
, like:Then in theory I think we could set it up so that something as simple as the following would re-code all of the coded columns in a dataframe:
In addition, we would then be explicitly storing more structured information, which could be exported into the documentation:
We would also be notified any time a new, unknown code appears in the data since the ETL would break.
Tasks:
code_fixes
andignored_codes
with the static dataframe(s) under metadata_check_no_overlapping_codes()
into two functions and turn them into validatorsResource.get_field(name)
andPackage.get_resource(name)
helper functions.Package.from_resource_ids()
so all foreign key relationships will exist.Resource.encode(df)
Resource.encode(df)
forenergy_sources_eia
coded columnsField.encode(Series)
or if we can always encode all columns in a table simultaneously.