XBRL extraction is much slower with pandas 2.1.1 than pandas 2.0.3

zaneselvans commented 1 year ago

Running the XBRL extractions using pandas 2.1.1 is much slower than 2.0.3. tox -e integration takes about 300 seconds with pandas 2.0.3, but 677 seconds with pandas 2.1.1.

The difference in runtime for pandas 2.1.1 seems to increase more than linearly with the size of the data being extracted, with the full extraction of all XBRL data taking 13 minutes using pandas 2.0.3 and more than 7 hours with pandas 2.1.1. See this comment thread.

Running

tox -re integration

Takes:

~600s with pandas>=1.5,<2.2
~500s with pandas[performance]>=1.5<2.2
~300s with pandas>=1.5,<2.1

I tried changing the string inference default to use PyArrow as suggested in the Pandas 2.1 release notes to all of the modules that import pandas as pd in case the problem was slow object treatment of strings, but it had no impact on overall runtime. Not sure if that's because it didn't actually have an effect on the treatment of strings, or if it did, and that wasn't the problem:

pd.options.future.infer_string = True

There's one recent open performance regression issue for pandas 2.1: https://github.com/pandas-dev/pandas/issues/55256

zaneselvans commented 1 year ago

Attempting to profile with py-spy on MacOS using pandas 2.1.1:

sudo -E py-spy top -s -- 
    /Users/zane/mambaforge/envs/pudl-dev/bin/python src/pudl/ferc_to_sqlite/cli.py \
    --clobber \
    --gcs-cache-path gs://zenodo-cache.catalyst.coop \
    src/pudl/package_data/settings/etl_full.yml

It seems like the process is hung up in ferc_xbrl_extractor.xbrl._dedupe_newer_report_wins() and apply() is notoriously inefficient, so that might be culprit. Not sure it would have gotten so much worse between versions of pandas though.

  0.00% 100.00%   0.160s     4311s   extract (ferc_xbrl_extractor/xbrl.py)
  0.00% 100.00%   0.000s     4295s   table_data_from_instances (ferc_xbrl_extractor/xbrl.py)
  0.00% 100.00%   0.120s     4290s   _dedupe_newer_report_wins (ferc_xbrl_extractor/xbrl.py)
  0.00% 100.00%   0.240s     4288s   apply (pandas/core/groupby/groupby.py)
  0.00% 100.00%    1.71s     4283s   apply_groupwise (pandas/core/groupby/ops.py)
  0.00% 100.00%    4.83s     4266s   <lambda> (ferc_xbrl_extractor/xbrl.py)
  0.00%  99.00%    3.88s     4064s   sort_values (pandas/core/frame.py)
  0.00%  99.00%    2.18s     4035s   copy (pandas/core/generic.py)
  0.00%  99.00%    3.58s     4029s   copy (pandas/core/internals/managers.py)
  0.00%  99.00%   12.06s     3965s   <listcomp> (pandas/core/internals/managers.py)
  0.00%  99.00%    4.36s     3945s   _view (pandas/core/indexes/base.py)
 99.00%  99.00%    3944s     3945s   _simple_new (pandas/core/indexes/base.py)
  0.00%  99.00%   0.510s     3945s   copy_func (pandas/core/internals/managers.py)
  0.00%  99.00%   0.820s     3941s   view (pandas/core/indexes/base.py)

zaneselvans commented 1 year ago

In _dedupe_newer_report_wins() switching from:

    return (
        df.reset_index()
        .groupby(unique_cols)
        .apply(lambda df: df.sort_values("report_date").ffill().iloc[-1])
        .drop("report_date", axis="columns")
        .set_index(old_index)
    )

to:


    return (
        df.reset_index()
        .sort_values("report_date")
        .groupby(unique_cols)
        .last()
        .drop("report_date", axis="columns")
        .reset_index()
        .set_index(old_index)
    )

Drops the time for tox -re integration significantly:

~200s with pandas[performance]>=1.5<2.2
~200s with pandas[performance]>=1.5<2.1
~200s with pandas>=1.5,<2.1

The py-spy profiling command above (extracting all of the FERC forms using ferc_to_sqlite) completed in 10 minutes on my laptop (whereas last night it took more than 7 hours)

zaneselvans commented 1 year ago

I tested that the new function results in exactly the same output as the old function using the following code:

import random
import string
from typing import Callable

def make_df(n_rows: int) -> pd.DataFrame:
    start_date = "1976-01-30"
    end_date = "2023-09-23"
    dates = pd.date_range(start=start_date, end=end_date, freq='D')

    df = pd.DataFrame(
        data={
            "pk1": np.random.randint(0, 100, n_rows),
            "pk2": np.random.choice(list(string.ascii_uppercase[:10]), size=n_rows),
            "pk3": np.random.choice(list(string.ascii_lowercase[:10]), size=n_rows),
            "idx1": np.random.randint(0, 10, n_rows),
            "idx2": np.random.randint(0, 10, n_rows),
            "report_date": np.random.choice(dates, n_rows),
            "data1": np.random.rand(n_rows),
            "data2": np.random.rand(n_rows),
            "data3": np.random.rand(n_rows),
            "data4": np.random.rand(n_rows),
        }
    ).set_index(["pk1", "pk2", "pk3", "idx1", "idx2"])
    df["data1"] = df["data1"].astype(str)

    data_cols = df.filter(like="data").columns
    print(f"{data_cols=}")
    df.loc[:, data_cols] = df[data_cols].mask(np.random.random((len(df), len(data_cols))) < 0.5)
    return df

def old_dedupe(df: pd.DataFrame, pk: list[str]) -> pd.DataFrame:
    old_index = df.index.names
    return (
        df.reset_index()
        .groupby(pk)
        .apply(lambda df: df.sort_values("report_date").ffill().iloc[-1])
        .drop("report_date", axis="columns")
        .set_index(old_index)
    )

def new_dedupe(df: pd.DataFrame, pk: list[str]) -> pd.DataFrame:
    old_index = df.index.names
    return (
        df.reset_index()
        .sort_values("report_date")
        .groupby(pk).last()
        .drop("report_date", axis="columns")
        .reset_index()
        .set_index(old_index)
    )

def compile_filings(
    dfs: list[pd.DataFrame],
    pk: list[str],
    dedupe_func: Callable
) -> pd.DataFrame:
    return pd.concat(dfs).pipe(dedupe_func, pk)

n_dfs = 10
n_rows = 10000

dfs = [make_df(n_rows) for x in range(n_dfs)]

pk = [col for col in dfs[0].index.names if col.startswith("pk")]

%timeit old_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=old_dedupe)
old_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=old_dedupe)

%timeit new_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=new_dedupe)
new_out = compile_filings(dfs=dfs, pk=pk, dedupe_func=new_dedupe)

pd.testing.assert_frame_equal(new_out, old_out)

Originally I used only data columns of type float and the old and new functions behaved identically. However, if I cast the floats to strings first (which is apparently how the data is coming out of XBRL) then there's a subtle difference between the two methods -- the old one results in nulls which are np.nan while the new one has None which seems to affect the inferred datatype when the data is later written into or read out of SQLite.

As a result I ended up needing to use pd.to_numeric() on a few string columns in the FERC 1 ETL, which I addressed in https://github.com/catalyst-cooperative/pudl/pull/2854.

catalyst-cooperative / ferc-xbrl-extractor

XBRL extraction is much slower with pandas 2.1.1 than pandas 2.0.3 #143