Open e-belfer opened 10 months ago
There's a complication here in that we have a standardized way of applying PUDL dtypes to tables when we write to or read from SQLite, but IIRC we aren't currently using an IO Manager for the EPA CEMS Parquet outputs, and that's where we would need to apply these dtypes.
Also, because we're writing to Parquet, we probably want to implement these dtypes through the Resource.to_pyarrow()
method, which would read in whatever metadata (e.g. ENUM constraints, nullability) has been associated with the fields and table, and translate them into a valid PyArrow schema. This is also something that we need to do for #3102
Is your feature request related to a problem? Please describe. Right now, CEMS datatypes are not handled like FERC or EIA data. Datatypes are not defined using
codes.py
orfields.py
, but rather there are a few constraints imposed usingapply_pudl_dtypes
. As of #3187, we now apply some dtypes when we read in raw CEMS data to reduce memory usage, but we don't enforce a set of categories for categorical columns, or define metadata for these fields.Describe the solution you'd like Reconfigure CEMS column dtype handling to be more similar to other datasets. CEMS data is stored as a parquet only, so column-level descriptions may not be required. In particular, categoricals should be defined in
codes.py
.Describe alternatives you've considered Currently columns are assigned dtypes in
pudl.extract.epacems
on read-in, using a dictionary. See #3187.