Open TrentonBush opened 3 years ago
If we're messing with that column anyway, I think we should consider renaming it to be more informative, and conform to the naming conventions we're using elsewhere. E.g. emissions_unit_id_epa
or smokestack_unit_id_epa
. At least I think that's what this column is referring to? We should really make it clear what all of facility_id
and unit_id_epa
and unitid
mean exactly, and I don't think it is now.
Another thing to be aware of with both Dask dataframes and pandas dataframes is if you groupby()
on a categorical column, you need to make sure you use observed=True
otherwise memory use explodes, since it creates groups for every single category in every group.
I would hesitate to change column names because they are basically part of the API, right? Or is this quite new and won't impact many downstream users?
Yeah, it's not great. But we also haven't really promised a stable API at this point. And it's v0.4 so... hopefully people aren't expecting that nothing will change. I don't think a ton of people are working with the CEMS data. With the integration of the EPA crosswalk that connects this table to the EIA data it seems like an appropriate time to rationalize the names and make it obvious what they refer to. And sooner will be better than later... There are some other columns elsewhere in the DB that will need to be renovated. We've talked about this a bit in the context of the entity resolution / harvesting process changes.
Another issue that switching to categoricals for the unit IDs may bring up is that these IDs are stored as strings in the EIA database tables, and if they're strings in one table and categoricals in another, I suspect merging on those columns will require resetting the types.
That makes sense. I figured there would be knock-on effects to consider, and I still haven't yet explored how all the datasets are connected. But I wanted to start a discussion/exploration with this issue.
EPA CEMS is bigger than laptop memory, there is no getting around that. But after loading, fully 50% of memory is taken by one column, 'unitid'. This column is a string dtype, but could probably be changed to categorical, saving 2GB memory per year of data and 50GB memory for the full dataset.
This can be done by users after loading the data, but I think would be better to change the dtype in the ETL pipeline before it is written to parquet, rather than after loading. Changing dtypes after loading requires reading the whole string column into memory first, which can exceed machine memory, crash the process, and prevent the dataset from loading at all.
Alternatively, the function to load epacems could read data in chunks, change the dtype of each, and concatenate them.
Which produces the following:
This shows that 'unitid' takes 2069 out of 4179 total MB, but when cast to categorical, only takes 69MB, a savings of 2GB. The cardinality is well within range of the categorical dtype, with only 1472 unique categories. There will doubtlessly be more categories with additional years, but only slightly more.