catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Clean up CEMS handling of datatypes #3221

Open e-belfer opened 5 months ago

e-belfer commented 5 months ago

Is your feature request related to a problem? Please describe. Right now, CEMS datatypes are not handled like FERC or EIA data. Datatypes are not defined using codes.py or fields.py, but rather there are a few constraints imposed using apply_pudl_dtypes. As of #3187, we now apply some dtypes when we read in raw CEMS data to reduce memory usage, but we don't enforce a set of categories for categorical columns, or define metadata for these fields.

Describe the solution you'd like Reconfigure CEMS column dtype handling to be more similar to other datasets. CEMS data is stored as a parquet only, so column-level descriptions may not be required. In particular, categoricals should be defined in codes.py.

Describe alternatives you've considered Currently columns are assigned dtypes in pudl.extract.epacems on read-in, using a dictionary. See #3187.

zaneselvans commented 5 months ago

There's a complication here in that we have a standardized way of applying PUDL dtypes to tables when we write to or read from SQLite, but IIRC we aren't currently using an IO Manager for the EPA CEMS Parquet outputs, and that's where we would need to apply these dtypes.

Also, because we're writing to Parquet, we probably want to implement these dtypes through the Resource.to_pyarrow() method, which would read in whatever metadata (e.g. ENUM constraints, nullability) has been associated with the fields and table, and translate them into a valid PyArrow schema. This is also something that we need to do for #3102