Clean up CEMS handling of datatypes

catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

MIT License

489 stars 115 forks source link

Is your feature request related to a problem? Please describe. Right now, CEMS datatypes are not handled like FERC or EIA data. Datatypes are not defined using codes.py or fields.py, but rather there are a few constraints imposed using apply_pudl_dtypes. As of #3187, we now apply some dtypes when we read in raw CEMS data to reduce memory usage, but we don't enforce a set of categories for categorical columns, or define metadata for these fields.

Describe the solution you'd like Reconfigure CEMS column dtype handling to be more similar to other datasets. CEMS data is stored as a parquet only, so column-level descriptions may not be required. In particular, categoricals should be defined in codes.py.

Describe alternatives you've considered Currently columns are assigned dtypes in pudl.extract.epacems on read-in, using a dictionary. See #3187.

catalyst-cooperative / pudl

Clean up CEMS handling of datatypes #3221