Reduce memory usage and manage RMI/PPL data dependencies

zaneselvans commented 2 years ago

In the course of setting up continuous integration in the rmi-ferc1-eia repository, we discovered that the current plant part list generation process uses a huge amount of memory -- something like 12 GB between the PPL itself, the MCOE, and the dataframes that feed into creating the PPL. This is bad in a variety of ways, and also prevents the CI from actually running since the GitHub runners only have 7 GB of memory.

Our current homebrew caching and dependency management system which we've been relying on in the PudlTabl class and now the RMI Output class is pretty messy and complex and makes no pretense of caring about how much memory is being used.

Somehow we need to manage the dependencies between different outputs and how they are cached more effectively. Some options:

Delete the gens_mega_eia and true_grans_eia dataframes that feed into the generation of the plant_parts_eia dataframe after they've been used.
Drop columns that aren't being used from these big dataframes
Convert object type columns to string where appropriate.
Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster graph abstraction.
Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.

bendnorman commented 2 years ago

I'm rerunning the tests and profiling the code to better understand where the memory is blowing up.

Delete the gens_mega_eia and true_grans_eia dataframes that feed into the generation of the plant_parts_eia dataframe after they've been used.

Drop columns that aren't being used from these big dataframes

These ideas could work but do not seem sustainable if the outputs require more dataframes or columns.

Convert object type columns to string where appropriate.

I'm on a quest to eradicate all object column types so I'm in favor of this idea. What kind of memory savings will this give us? How much cleaning will this require?

Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster graph abstraction.

Dagster provides a standard way to persist interim data frames to disk and hopefully save memory. It could also bring some clarity to the plant parts logic.

I don't think persisting dataframes to disk will solve our memory issues if we have one or two large dataframes hogging all of the memory. We might need to use dask for the processing.

Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.

This seems like a great solution if we are only doing joins between large datasets.

zaneselvans commented 2 years ago

Given that the entire PUDL database is only like 500 MB on disk, it seems like in most cases we shouldn't need to have multiple multi-GB dataframes flowing around. I suspect that if we aren't inadvertently loading several copies of the same dataframe or keeping a bunch of columns that we don't need for the work at hand, that persisting them to disk along the way will be enough, at least until we've got the whole DAG running on our own infrastructure.

Looking at the saved PPL dataframe, it seems like there are a bunch of object columns which could be strings or categoricals that make up a large majority of the memory usage, and I suspect none of the ints or floats need to be 64-bit.

import pickle
ppl_eia = pd.read_pickle("../../projects/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz")
dude = pd.DataFrame({
    "memory_usage_mb": round(ppl_eia.memory_usage(deep=True) / 2**20, ndigits=2),
    "dtype": ppl_eia.dtypes,
    "nunique": ppl_eia.nunique(),
}).sort_values("memory_usage_mb", ascending=False)
dude

	memory_usage_mb	dtype	nunique
Index	307.72	nan	nan
plant_part_id_eia	292.17	object	469178
appro_record_id_eia	278.05	object	1368717
plant_name_new	243.81	string	112474
technology_description	242.86	object	28
plant_name_eia	224.84	string	13901
plant_part	215.36	object	7
plant_id_report_year	204.93	object	167546
operational_status_pudl	204.32	object	4
appro_part_label	199.92	object	7
ownership	192.75	object	2
ferc_acct_name	191.59	object	4
fuel_type_code_pudl	188.39	string	9
energy_source_code_1	184.38	string	39
prime_mover_code	178.82	object	19
generator_id	163.44	string	9489
unit_id_pudl	27.98	Int64	10
utility_id_eia	27.98	Int64	10457
plant_id_pudl	27.98	Int64	13997
plant_id_eia	27.98	Int64	14143
utility_id_pudl	27.98	Int64	9430
installation_year	27.98	Int64	127
report_date	24.87	datetime64[ns]	21
report_year	24.87	int64	21
capacity_eoy_mw	24.87	float64	9908
capacity_factor	24.87	float64	227035
total_mmbtu	24.87	float64	60893
total_fuel_cost	24.87	float64	26336
capacity_mw	24.87	float64	10752
retirement_date	24.87	datetime64[ns]	317
planned_retirement_date	24.87	datetime64[ns]	256
record_count	24.87	float64	40
fraction_owned	24.87	float64	3889
fuel_cost_per_mmbtu	24.87	float64	8866
fuel_cost_per_mwh	24.87	float64	16841
heat_rate_mmbtu_mwh	24.87	float64	32386
net_generation_mwh	24.87	float64	289650
operational_status	3.11	category	3
true_gran	3.11	bool	2
ownership_dupe	3.11	bool	2

zaneselvans commented 2 years ago

Turning the object columns into string or category columns, and taking some of the low cardinality string columns to category with the following cuts the total memory usage down to about 2.2GB:

ppl_eia_small_types = ppl_eia.reset_index().astype({
    "record_id_eia": "string",
    "plant_part_id_eia": "string",
    "appro_record_id_eia": "string",
    "technology_description": "category",
    "energy_source_code_1": "category",
    "plant_part": "category",
    "plant_id_report_year": "string",
    "operational_status_pudl": "category",
    "appro_part_label": "category",
    "ownership": "category",
    "ferc_acct_name": "category",
    "fuel_type_code_pudl": "category",
    "prime_mover_code": "category",
})

bendnorman commented 2 years ago

That is good news. Maybe we start by converting objects to Strings and dropping extra columns. Hopefully, that is enough to get the tests running on GitHub runners.

katie-lamb commented 2 years ago

I'm picking up this issue and think I'll tackle it in this order. Referring to Zane's original list of avenues to explore in managing dependencies between outputs and caching effectively:

[x] Look into why the integration tests aren't passing locally. I think fixing them can happen in parallel to the below steps to reduce memory and will provide familiarity with the data dependencies in the repo.
[x] Use dev branch as PUDL installation on RMI branch so the plant_parts_eia creation uses the more efficient code that was reworked for the CCAI grant. This is more memory efficient. In the rework, true_grans_eia is no longer a thing and a bunch of columns got dropped with the new date_merge default parameter values. gens_mega_eia output should still be deleted after use.
[x] Convert object type columns to string where appropriate.
[x] Drop columns from large intermediate input dataframes that aren't necessary in PPL creation. gens_eia860 is good candidate. Is it used for creation of plants_eia860?
[x] In PPL creation delete large dataframes that are cached in pudl_out after they've been used
[ ] Merge over changes from CG's PR that add a unit test
[ ] change frame.append to concat

After generating the PPL from the refactored code and updating column data types the PPL is now only 1GB but generating it still uses a little over 7GB. I think in the long run persisting dataframes to disk with Dagster is the way to go. To make the CI pass in the short term I think dropping columns from the intermediary input dataframes should work.

Memory usage of the big hitter cached dataframes from the PPL process:

Next steps would be:

Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster graph abstraction.

Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.

Running Questions:

Edit plant parts list creation in PUDL repo to convert the column data types to strings.
While debugging the integration tests that aren't passing I found out that the record with record_id_eia = 299_2020_plant_total_14354_retired is not part of the plant parts list (at least when the PPL is being generated from dev and object columns are converted to strings. Is this okay?? The PPL has the records 299_2020_plant_total_14354 and 299_2020_plant_total_14354_proposed.
- Found this from an error I resolved in connect_ferc1_to_eia.prep_train_connections
Move the validation tests hanging out in the integration tests to separate validation tests?
Does pudl_out._dfs.clear() actually free memory?

catalyst-cooperative / pudl

Reduce memory usage and manage RMI/PPL data dependencies #1457