Open aesharpe opened 1 year ago
I think this type of fix can probably be generalized beyond the encoded columns to any column of categorical values which we expect to change rarely if at all. The encoding process will null out bad values, but there are other ways that we do already or could ensure that unencoded categorical values are also set to NA (e.g. with the pd.Series.map()
method).
Retaining report_date
and using it to order the records before filling is probably useful, since we probably want to prioritize continuity -- e.g. a generator technology type might actually change, and if there's an NA value that lies between the old technology type and the new technology type, it's much more ambiguous as to what value should be filled than if you've got one missing technology type value in the middle of a long series of identical values.
If this kind of data repair happens after the harvesting step then it won't influence our evaluation of whether a value is consistent enough to be retained. It might be reasonable for the repair to happen before the harvesting step in all the tables where the value is originally reported, but that could have downstream effects.
Issue Context
This issue was born out of the discussion in #2602
The
emissions_control_equipment_eia860
table had a values (CS
) foroperational_status_code
in 2016 that were not outlined in the EIA code documentation for that year (or others).The solution is two-fold:
1) Null out the bad codes by specifying them in
ignored_codes
as part of theCODE_METADATA
incodes.py
(this task is accomplished by #2602)2) Create a function to interpolate missing
operational_status_code
values and otherCODE_METADATA
that is missing.Solution Breakdown
I was able to figure out that the rogue
CS
inoperational_status_code
was supposed to beRE
based on the value for the sameparticulate_control_id_eia
in the previous and following years.We might be able to develop a function that will:
report_date
to see if the encoded values are consistent in other yearsI'm not exactly sure where this function will live or when it will be applied.