Closed zaneselvans closed 1 year ago
I'm rerunning the tests and profiling the code to better understand where the memory is blowing up.
Delete the
gens_mega_eia
andtrue_grans_eia
dataframes that feed into the generation of theplant_parts_eia
dataframe after they've been used.Drop columns that aren't being used from these big dataframes
These ideas could work but do not seem sustainable if the outputs require more dataframes or columns.
Convert
object
type columns tostring
where appropriate.
I'm on a quest to eradicate all object
column types so I'm in favor of this idea. What kind of memory savings will this give us? How much cleaning will this require?
Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster
graph
abstraction.
Dagster provides a standard way to persist interim data frames to disk and hopefully save memory. It could also bring some clarity to the plant parts logic.
I don't think persisting dataframes to disk will solve our memory issues if we have one or two large dataframes hogging all of the memory. We might need to use dask for the processing.
Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.
This seems like a great solution if we are only doing joins between large datasets.
Given that the entire PUDL database is only like 500 MB on disk, it seems like in most cases we shouldn't need to have multiple multi-GB dataframes flowing around. I suspect that if we aren't inadvertently loading several copies of the same dataframe or keeping a bunch of columns that we don't need for the work at hand, that persisting them to disk along the way will be enough, at least until we've got the whole DAG running on our own infrastructure.
Looking at the saved PPL dataframe, it seems like there are a bunch of object columns which could be strings or categoricals that make up a large majority of the memory usage, and I suspect none of the ints or floats need to be 64-bit.
import pickle
ppl_eia = pd.read_pickle("../../projects/rmi-ferc1-eia/outputs/plant_parts_eia.pkl.gz")
dude = pd.DataFrame({
"memory_usage_mb": round(ppl_eia.memory_usage(deep=True) / 2**20, ndigits=2),
"dtype": ppl_eia.dtypes,
"nunique": ppl_eia.nunique(),
}).sort_values("memory_usage_mb", ascending=False)
dude
memory_usage_mb | dtype | nunique | |
---|---|---|---|
Index | 307.72 | nan | nan |
plant_part_id_eia | 292.17 | object | 469178 |
appro_record_id_eia | 278.05 | object | 1368717 |
plant_name_new | 243.81 | string | 112474 |
technology_description | 242.86 | object | 28 |
plant_name_eia | 224.84 | string | 13901 |
plant_part | 215.36 | object | 7 |
plant_id_report_year | 204.93 | object | 167546 |
operational_status_pudl | 204.32 | object | 4 |
appro_part_label | 199.92 | object | 7 |
ownership | 192.75 | object | 2 |
ferc_acct_name | 191.59 | object | 4 |
fuel_type_code_pudl | 188.39 | string | 9 |
energy_source_code_1 | 184.38 | string | 39 |
prime_mover_code | 178.82 | object | 19 |
generator_id | 163.44 | string | 9489 |
unit_id_pudl | 27.98 | Int64 | 10 |
utility_id_eia | 27.98 | Int64 | 10457 |
plant_id_pudl | 27.98 | Int64 | 13997 |
plant_id_eia | 27.98 | Int64 | 14143 |
utility_id_pudl | 27.98 | Int64 | 9430 |
installation_year | 27.98 | Int64 | 127 |
report_date | 24.87 | datetime64[ns] | 21 |
report_year | 24.87 | int64 | 21 |
capacity_eoy_mw | 24.87 | float64 | 9908 |
capacity_factor | 24.87 | float64 | 227035 |
total_mmbtu | 24.87 | float64 | 60893 |
total_fuel_cost | 24.87 | float64 | 26336 |
capacity_mw | 24.87 | float64 | 10752 |
retirement_date | 24.87 | datetime64[ns] | 317 |
planned_retirement_date | 24.87 | datetime64[ns] | 256 |
record_count | 24.87 | float64 | 40 |
fraction_owned | 24.87 | float64 | 3889 |
fuel_cost_per_mmbtu | 24.87 | float64 | 8866 |
fuel_cost_per_mwh | 24.87 | float64 | 16841 |
heat_rate_mmbtu_mwh | 24.87 | float64 | 32386 |
net_generation_mwh | 24.87 | float64 | 289650 |
operational_status | 3.11 | category | 3 |
true_gran | 3.11 | bool | 2 |
ownership_dupe | 3.11 | bool | 2 |
Turning the object
columns into string
or category
columns, and taking some of the low cardinality string
columns to category
with the following cuts the total memory usage down to about 2.2GB:
ppl_eia_small_types = ppl_eia.reset_index().astype({
"record_id_eia": "string",
"plant_part_id_eia": "string",
"appro_record_id_eia": "string",
"technology_description": "category",
"energy_source_code_1": "category",
"plant_part": "category",
"plant_id_report_year": "string",
"operational_status_pudl": "category",
"appro_part_label": "category",
"ownership": "category",
"ferc_acct_name": "category",
"fuel_type_code_pudl": "category",
"prime_mover_code": "category",
})
That is good news. Maybe we start by converting objects to Strings and dropping extra columns. Hopefully, that is enough to get the tests running on GitHub runners.
I'm picking up this issue and think I'll tackle it in this order. Referring to Zane's original list of avenues to explore in managing dependencies between outputs and caching effectively:
dev
branch as PUDL installation on RMI branch so the plant_parts_eia
creation uses the more efficient code that was reworked for the CCAI grant. This is more memory efficient. In the rework, true_grans_eia
is no longer a thing and a bunch of columns got dropped with the new date_merge
default parameter values. gens_mega_eia
output should still be deleted after use.object
type columns to string
where appropriate.gens_eia860
is good candidate. Is it used for creation of plants_eia860
?pudl_out
after they've been used frame.append
to concat
After generating the PPL from the refactored code and updating column data types the PPL is now only 1GB but generating it still uses a little over 7GB. I think in the long run persisting dataframes to disk with Dagster is the way to go. To make the CI pass in the short term I think dropping columns from the intermediary input dataframes should work.
Memory usage of the big hitter cached dataframes from the PPL process:
Next steps would be:
- Use Dagster to refactor the RMI output object, persist dataframes to disk using their system, and manage the dependencies between the different dataframes using the Dagster
graph
abstraction.- Switch to using database views for simple output tables, allowing that data to remain in the DB unless we're actually using it -- it's the pandas merges that make pulling that data super slow. Doing the same joins in SQL is almost instantaneous.
Running Questions:
record_id_eia
= 299_2020_plant_total_14354_retired
is not part of the plant parts list (at least when the PPL is being generated from dev
and object columns are converted to strings. Is this okay?? The PPL has the records 299_2020_plant_total_14354
and 299_2020_plant_total_14354_proposed
.
connect_ferc1_to_eia.prep_train_connections
pudl_out._dfs.clear()
actually free memory?
In the course of setting up continuous integration in the
rmi-ferc1-eia
repository, we discovered that the current plant part list generation process uses a huge amount of memory -- something like 12 GB between the PPL itself, the MCOE, and the dataframes that feed into creating the PPL. This is bad in a variety of ways, and also prevents the CI from actually running since the GitHub runners only have 7 GB of memory.Our current homebrew caching and dependency management system which we've been relying on in the
PudlTabl
class and now the RMIOutput
class is pretty messy and complex and makes no pretense of caring about how much memory is being used.Somehow we need to manage the dependencies between different outputs and how they are cached more effectively. Some options:
gens_mega_eia
andtrue_grans_eia
dataframes that feed into the generation of theplant_parts_eia
dataframe after they've been used.object
type columns tostring
where appropriate.graph
abstraction.