catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Reduce memory usage and manage RMI/PPL data dependencies #1457

Closed zaneselvans closed 1 year ago

zaneselvans commented 2 years ago

In the course of setting up continuous integration in the rmi-ferc1-eia repository, we discovered that the current plant part list generation process uses a huge amount of memory -- something like 12 GB between the PPL itself, the MCOE, and the dataframes that feed into creating the PPL. This is bad in a variety of ways, and also prevents the CI from actually running since the GitHub runners only have 7 GB of memory.

Our current homebrew caching and dependency management system which we've been relying on in the PudlTabl class and now the RMI Output class is pretty messy and complex and makes no pretense of caring about how much memory is being used.

Somehow we need to manage the dependencies between different outputs and how they are cached more effectively. Some options:

bendnorman commented 2 years ago

I'm rerunning the tests and profiling the code to better understand where the memory is blowing up.

Delete the gens_mega_eia and true_grans_eia dataframes that feed into the generation of the plant_parts_eia dataframe after they've been used.

Drop columns that aren't being used from these big dataframes

These ideas could work but do not seem sustainable if the outputs require more dataframes or columns.

Convert object type columns to string where appropriate.

I'm on a quest to eradicate all object column types so I'm in favor of this idea. What kind of memory savings will this give us? How much cleaning will this require?

Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster graph abstraction.

Dagster provides a standard way to persist interim data frames to disk and hopefully save memory. It could also bring some clarity to the plant parts logic.

I don't think persisting dataframes to disk will solve our memory issues if we have one or two large dataframes hogging all of the memory. We might need to use dask for the processing.

Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.

This seems like a great solution if we are only doing joins between large datasets.

zaneselvans commented 2 years ago

Given that the entire PUDL database is only like 500 MB on disk, it seems like in most cases we shouldn't need to have multiple multi-GB dataframes flowing around. I suspect that if we aren't inadvertently loading several copies of the same dataframe or keeping a bunch of columns that we don't need for the work at hand, that persisting them to disk along the way will be enough, at least until we've got the whole DAG running on our own infrastructure.

Looking at the saved PPL dataframe, it seems like there are a bunch of object columns which could be strings or categoricals that make up a large majority of the memory usage, and I suspect none of the ints or floats need to be 64-bit.

import pickle
ppl_eia = pd.read_pickle("../../projects/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz")
dude = pd.DataFrame({
    "memory_usage_mb": round(ppl_eia.memory_usage(deep=True) / 2**20, ndigits=2),
    "dtype": ppl_eia.dtypes,
    "nunique": ppl_eia.nunique(),
}).sort_values("memory_usage_mb", ascending=False)
dude
memory_usage_mb dtype nunique
Index 307.72 nan nan
plant_part_id_eia 292.17 object 469178
appro_record_id_eia 278.05 object 1368717
plant_name_new 243.81 string 112474
technology_description 242.86 object 28
plant_name_eia 224.84 string 13901
plant_part 215.36 object 7
plant_id_report_year 204.93 object 167546
operational_status_pudl 204.32 object 4
appro_part_label 199.92 object 7
ownership 192.75 object 2
ferc_acct_name 191.59 object 4
fuel_type_code_pudl 188.39 string 9
energy_source_code_1 184.38 string 39
prime_mover_code 178.82 object 19
generator_id 163.44 string 9489
unit_id_pudl 27.98 Int64 10
utility_id_eia 27.98 Int64 10457
plant_id_pudl 27.98 Int64 13997
plant_id_eia 27.98 Int64 14143
utility_id_pudl 27.98 Int64 9430
installation_year 27.98 Int64 127
report_date 24.87 datetime64[ns] 21
report_year 24.87 int64 21
capacity_eoy_mw 24.87 float64 9908
capacity_factor 24.87 float64 227035
total_mmbtu 24.87 float64 60893
total_fuel_cost 24.87 float64 26336
capacity_mw 24.87 float64 10752
retirement_date 24.87 datetime64[ns] 317
planned_retirement_date 24.87 datetime64[ns] 256
record_count 24.87 float64 40
fraction_owned 24.87 float64 3889
fuel_cost_per_mmbtu 24.87 float64 8866
fuel_cost_per_mwh 24.87 float64 16841
heat_rate_mmbtu_mwh 24.87 float64 32386
net_generation_mwh 24.87 float64 289650
operational_status 3.11 category 3
true_gran 3.11 bool 2
ownership_dupe 3.11 bool 2
zaneselvans commented 2 years ago

Turning the object columns into string or category columns, and taking some of the low cardinality string columns to category with the following cuts the total memory usage down to about 2.2GB:

ppl_eia_small_types = ppl_eia.reset_index().astype({
    "record_id_eia": "string",
    "plant_part_id_eia": "string",
    "appro_record_id_eia": "string",
    "technology_description": "category",
    "energy_source_code_1": "category",
    "plant_part": "category",
    "plant_id_report_year": "string",
    "operational_status_pudl": "category",
    "appro_part_label": "category",
    "ownership": "category",
    "ferc_acct_name": "category",
    "fuel_type_code_pudl": "category",
    "prime_mover_code": "category",
})
bendnorman commented 2 years ago

That is good news. Maybe we start by converting objects to Strings and dropping extra columns. Hopefully, that is enough to get the tests running on GitHub runners.

katie-lamb commented 2 years ago

I'm picking up this issue and think I'll tackle it in this order. Referring to Zane's original list of avenues to explore in managing dependencies between outputs and caching effectively:

After generating the PPL from the refactored code and updating column data types the PPL is now only 1GB but generating it still uses a little over 7GB. I think in the long run persisting dataframes to disk with Dagster is the way to go. To make the CI pass in the short term I think dropping columns from the intermediary input dataframes should work.

Memory usage of the big hitter cached dataframes from the PPL process:

Screen Shot 2022-07-22 at 10 48 27 AM

Next steps would be:

  • Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster graph abstraction.
  • Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.

Running Questions: