Change EPA CEMS data types to reduce memory usage

TrentonBush commented 3 years ago

EPA CEMS is bigger than laptop memory, there is no getting around that. But after loading, fully 50% of memory is taken by one column, 'unitid'. This column is a string dtype, but could probably be changed to categorical, saving 2GB memory per year of data and 50GB memory for the full dataset.

This can be done by users after loading the data, but I think would be better to change the dtype in the ETL pipeline before it is written to parquet, rather than after loading. Changing dtypes after loading requires reading the whole string column into memory first, which can exceed machine memory, crash the process, and prevent the dataset from loading at all.

Alternatively, the function to load epacems could read data in chunks, change the dtype of each, and concatenate them.

import pudl
import pandas as pd
from pathlib import Path

pudl_settings = pudl.workspace.setup.get_defaults()
cems_path = Path(pudl_settings["parquet_dir"]) / "epacems"

# read 1 year, all columns, all states of cems
cems_1year = pd.read_parquet(
        cems_path,
        use_nullable_dtypes=True,
        columns=None, # all cols
        filters=pudl.output.epacems.year_state_filter(
            states=pudl.constants.us_states.keys(), # all states
            years=2019,
        ),
)

print(cems.memory_usage(deep=True).sum() / 2**20) # megabytes total
print(cems.memory_usage(deep=True) / 2**20) # megabytes per column
print(cems.unitid.astype('category').memory_usage(deep=True) / 2**20) # megabytes unitid as categorical
print(cems.unitid.unique().shape[0]) # cardinality of unitid

Which produces the following:

4179.72

Index                           0.000122
plant_id_eia                  172.931328
unitid                       2069.946968
operating_datetime_utc        276.690125
operating_time_hours          138.345062
gross_load_mw                 138.345062
steam_load_1000_lbs           138.345062
so2_mass_lbs                  138.345062
so2_mass_measurement_code      34.587056
nox_rate_lbs_mmbtu            138.345062
nox_rate_measurement_code      34.587120
nox_mass_lbs                  138.345062
nox_mass_measurement_code      34.587120
co2_mass_tons                 138.345062
co2_mass_measurement_code      34.587056
heat_content_mmbtu            138.345062
facility_id                   172.931328
unit_id_epa                   172.931328
year                           34.586987
state                          34.590045
dtype: float64

69.29

1472

This shows that 'unitid' takes 2069 out of 4179 total MB, but when cast to categorical, only takes 69MB, a savings of 2GB. The cardinality is well within range of the categorical dtype, with only 1472 unique categories. There will doubtlessly be more categories with additional years, but only slightly more.

zaneselvans commented 3 years ago

If we're messing with that column anyway, I think we should consider renaming it to be more informative, and conform to the naming conventions we're using elsewhere. E.g. emissions_unit_id_epa or smokestack_unit_id_epa. At least I think that's what this column is referring to? We should really make it clear what all of facility_id and unit_id_epa and unitid mean exactly, and I don't think it is now.

zaneselvans commented 3 years ago

Another thing to be aware of with both Dask dataframes and pandas dataframes is if you groupby() on a categorical column, you need to make sure you use observed=True otherwise memory use explodes, since it creates groups for every single category in every group.

TrentonBush commented 3 years ago

I would hesitate to change column names because they are basically part of the API, right? Or is this quite new and won't impact many downstream users?

zaneselvans commented 3 years ago

Yeah, it's not great. But we also haven't really promised a stable API at this point. And it's v0.4 so... hopefully people aren't expecting that nothing will change. I don't think a ton of people are working with the CEMS data. With the integration of the EPA crosswalk that connects this table to the EIA data it seems like an appropriate time to rationalize the names and make it obvious what they refer to. And sooner will be better than later... There are some other columns elsewhere in the DB that will need to be renovated. We've talked about this a bit in the context of the entity resolution / harvesting process changes.

Another issue that switching to categoricals for the unit IDs may bring up is that these IDs are stored as strings in the EIA database tables, and if they're strings in one table and categoricals in another, I suspect merging on those columns will require resetting the types.

TrentonBush commented 3 years ago

That makes sense. I figured there would be knock-on effects to consider, and I still haven't yet explored how all the datasets are connected. But I wanted to start a discussion/exploration with this issue.

catalyst-cooperative / pudl

Change EPA CEMS data types to reduce memory usage #1049