catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

clean operational_status_code from eia860 #1588

Closed cmgosnell closed 2 years ago

cmgosnell commented 2 years ago

The operational_status_code column... is a bit of a mess. Most of the mess comes from the eia860m data (see below). Let's clean it up! operational_status_code should probably end up being an enum in our metadata and we should be able to map some of these bad codes to good codes. I imagine we could use some regex magic to grab the 2021 codes that tend to look like "(CODE) Some long description".

We are using these codes to generate the operational_status column here. A desired end state of this issue would be to clean the operational_status_code column to such a degree that all of the end-state codes can be mapped to operational_status's.

demonstration of problem

gens = pudl_out.gens_eia860()
(
    gens.assign(count=1)
    .groupby(["operational_status_code"], dropna=False)[["count"]]
    .count()
    .sort_values(["count"], ascending=False)
)
image
(
    gens[gens.report_date.dt.year != 2021].assign(count=1)
    .groupby([ "operational_status_code"], dropna=False)[["count"]]
    .count()
    .sort_values(["count"], ascending=False)
)
image

annnnd

(
    gens[gens.report_date.dt.year == 2021].assign(count=1)
    .groupby([ "operational_status_code"], dropna=False)[["count"]]
    .count()
    .sort_values(["count"], ascending=False)
)
image
cmgosnell commented 2 years ago

expected codes from eia860:

image image