catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
466 stars 107 forks source link

Ensure categorical columns use appropriate NA values #1210

Closed zaneselvans closed 2 years ago

zaneselvans commented 2 years ago

In some of our categorical columns, at the end of the transform step we have only the enumerated values, plus the empty string. In the database, these columns are nullable, and so the empty string values should really be pd.NA. However, currently we have the empty string in some of our enumerations to avoid failing constraint checks. These should all be removed, and the corresponding columns should have the empty string replaced with pd.NA before loading. Columns that this is happening in include:

FERC 1

EIA

EPA CEMS

ezwelty commented 2 years ago

I listed additional placeholder values that need to be replaced with null in https://github.com/catalyst-cooperative/pudl/pull/806#issue-509727837 and removed in the metadata, namely: