Closed zaneselvans closed 1 year ago
Attempting to profile with py-spy
on MacOS using pandas 2.1.1:
sudo -E py-spy top -s --
/Users/zane/mambaforge/envs/pudl-dev/bin/python src/pudl/ferc_to_sqlite/cli.py \
--clobber \
--gcs-cache-path gs://zenodo-cache.catalyst.coop \
src/pudl/package_data/settings/etl_full.yml
It seems like the process is hung up in ferc_xbrl_extractor.xbrl._dedupe_newer_report_wins()
and apply()
is notoriously inefficient, so that might be culprit. Not sure it would have gotten so much worse between versions of pandas though.
0.00% 100.00% 0.160s 4311s extract (ferc_xbrl_extractor/xbrl.py)
0.00% 100.00% 0.000s 4295s table_data_from_instances (ferc_xbrl_extractor/xbrl.py)
0.00% 100.00% 0.120s 4290s _dedupe_newer_report_wins (ferc_xbrl_extractor/xbrl.py)
0.00% 100.00% 0.240s 4288s apply (pandas/core/groupby/groupby.py)
0.00% 100.00% 1.71s 4283s apply_groupwise (pandas/core/groupby/ops.py)
0.00% 100.00% 4.83s 4266s <lambda> (ferc_xbrl_extractor/xbrl.py)
0.00% 99.00% 3.88s 4064s sort_values (pandas/core/frame.py)
0.00% 99.00% 2.18s 4035s copy (pandas/core/generic.py)
0.00% 99.00% 3.58s 4029s copy (pandas/core/internals/managers.py)
0.00% 99.00% 12.06s 3965s <listcomp> (pandas/core/internals/managers.py)
0.00% 99.00% 4.36s 3945s _view (pandas/core/indexes/base.py)
99.00% 99.00% 3944s 3945s _simple_new (pandas/core/indexes/base.py)
0.00% 99.00% 0.510s 3945s copy_func (pandas/core/internals/managers.py)
0.00% 99.00% 0.820s 3941s view (pandas/core/indexes/base.py)
In _dedupe_newer_report_wins()
switching from:
return (
df.reset_index()
.groupby(unique_cols)
.apply(lambda df: df.sort_values("report_date").ffill().iloc[-1])
.drop("report_date", axis="columns")
.set_index(old_index)
)
to:
return (
df.reset_index()
.sort_values("report_date")
.groupby(unique_cols)
.last()
.drop("report_date", axis="columns")
.reset_index()
.set_index(old_index)
)
Drops the time for tox -re integration
significantly:
pandas[performance]>=1.5<2.2
pandas[performance]>=1.5<2.1
pandas>=1.5,<2.1
The py-spy
profiling command above (extracting all of the FERC forms using ferc_to_sqlite
) completed in 10 minutes on my laptop (whereas last night it took more than 7 hours)
I tested that the new function results in exactly the same output as the old function using the following code:
import random
import string
from typing import Callable
def make_df(n_rows: int) -> pd.DataFrame:
start_date = "1976-01-30"
end_date = "2023-09-23"
dates = pd.date_range(start=start_date, end=end_date, freq='D')
df = pd.DataFrame(
data={
"pk1": np.random.randint(0, 100, n_rows),
"pk2": np.random.choice(list(string.ascii_uppercase[:10]), size=n_rows),
"pk3": np.random.choice(list(string.ascii_lowercase[:10]), size=n_rows),
"idx1": np.random.randint(0, 10, n_rows),
"idx2": np.random.randint(0, 10, n_rows),
"report_date": np.random.choice(dates, n_rows),
"data1": np.random.rand(n_rows),
"data2": np.random.rand(n_rows),
"data3": np.random.rand(n_rows),
"data4": np.random.rand(n_rows),
}
).set_index(["pk1", "pk2", "pk3", "idx1", "idx2"])
df["data1"] = df["data1"].astype(str)
data_cols = df.filter(like="data").columns
print(f"{data_cols=}")
df.loc[:, data_cols] = df[data_cols].mask(np.random.random((len(df), len(data_cols))) < 0.5)
return df
def old_dedupe(df: pd.DataFrame, pk: list[str]) -> pd.DataFrame:
old_index = df.index.names
return (
df.reset_index()
.groupby(pk)
.apply(lambda df: df.sort_values("report_date").ffill().iloc[-1])
.drop("report_date", axis="columns")
.set_index(old_index)
)
def new_dedupe(df: pd.DataFrame, pk: list[str]) -> pd.DataFrame:
old_index = df.index.names
return (
df.reset_index()
.sort_values("report_date")
.groupby(pk).last()
.drop("report_date", axis="columns")
.reset_index()
.set_index(old_index)
)
def compile_filings(
dfs: list[pd.DataFrame],
pk: list[str],
dedupe_func: Callable
) -> pd.DataFrame:
return pd.concat(dfs).pipe(dedupe_func, pk)
n_dfs = 10
n_rows = 10000
dfs = [make_df(n_rows) for x in range(n_dfs)]
pk = [col for col in dfs[0].index.names if col.startswith("pk")]
%timeit old_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=old_dedupe)
old_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=old_dedupe)
%timeit new_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=new_dedupe)
new_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=new_dedupe)
pd.testing.assert_frame_equal(new_out, old_out)
Originally I used only data columns of type float
and the old and new functions behaved identically. However, if I cast the floats to strings first (which is apparently how the data is coming out of XBRL) then there's a subtle difference between the two methods -- the old one results in nulls which are np.nan
while the new one has None
which seems to affect the inferred datatype when the data is later written into or read out of SQLite.
As a result I ended up needing to use pd.to_numeric()
on a few string columns in the FERC 1 ETL, which I addressed in https://github.com/catalyst-cooperative/pudl/pull/2854.
Running the XBRL extractions using pandas 2.1.1 is much slower than 2.0.3.
tox -e integration
takes about 300 seconds with pandas 2.0.3, but 677 seconds with pandas 2.1.1.The difference in runtime for pandas 2.1.1 seems to increase more than linearly with the size of the data being extracted, with the full extraction of all XBRL data taking 13 minutes using pandas 2.0.3 and more than 7 hours with pandas 2.1.1. See this comment thread.
Running
Takes:
pandas>=1.5,<2.2
pandas[performance]>=1.5<2.2
pandas>=1.5,<2.1
I tried changing the string inference default to use PyArrow as suggested in the Pandas 2.1 release notes to all of the modules that
import pandas as pd
in case the problem was slowobject
treatment of strings, but it had no impact on overall runtime. Not sure if that's because it didn't actually have an effect on the treatment of strings, or if it did, and that wasn't the problem:There's one recent open performance regression issue for pandas 2.1: https://github.com/pandas-dev/pandas/issues/55256