catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Add function to interpolate gaps in encoded columns #2616

Open aesharpe opened 1 year ago

aesharpe commented 1 year ago

Issue Context

This issue was born out of the discussion in #2602

The emissions_control_equipment_eia860 table had a values (CS) for operational_status_code in 2016 that were not outlined in the EIA code documentation for that year (or others).

The solution is two-fold:

1) Null out the bad codes by specifying them in ignored_codes as part of the CODE_METADATA in codes.py (this task is accomplished by #2602)

2) Create a function to interpolate missing operational_status_code values and other CODE_METADATA that is missing.

Solution Breakdown

I was able to figure out that the rogue CS in operational_status_code was supposed to be RE based on the value for the same particulate_control_id_eia in the previous and following years.

We might be able to develop a function that will:

I'm not exactly sure where this function will live or when it will be applied.

zaneselvans commented 1 year ago

I think this type of fix can probably be generalized beyond the encoded columns to any column of categorical values which we expect to change rarely if at all. The encoding process will null out bad values, but there are other ways that we do already or could ensure that unencoded categorical values are also set to NA (e.g. with the pd.Series.map() method).

Retaining report_date and using it to order the records before filling is probably useful, since we probably want to prioritize continuity -- e.g. a generator technology type might actually change, and if there's an NA value that lies between the old technology type and the new technology type, it's much more ambiguous as to what value should be filled than if you've got one missing technology type value in the middle of a long series of identical values.

If this kind of data repair happens after the harvesting step then it won't influence our evaluation of whether a value is consistent enough to be retained. It might be reasonable for the repair to happen before the harvesting step in all the tables where the value is originally reported, but that could have downstream effects.