Use dataframe.groupby instead of iterating (rebased)

jvlmdr commented 4 years ago

I noticed that iteratively selecting rows from the dataframe was a serious bottleneck.

It looks like someone was already investigating this. I have removed the use of the cached analysis and the lines which computed timings.

I isolated the code for extracting counts and added a benchmark (and a dependency on pytest-benchmark).

Before:

--------------------------------------------------------- benchmark: 1 tests ---------------------------------------------------------
Name (time in s)                                  Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_extract_counts_from_df_map     15.4156  16.1166  15.6762  0.3507  15.4331  0.6114       1;0  0.0638       5           1
--------------------------------------------------------------------------------------------------------------------------------------

After (time in ms not s):

------------------------------------------------------------ benchmark: 1 tests ------------------------------------------------------------
Name (time in ms)                                  Min       Max      Mean   StdDev    Median      IQR  Outliers     OPS  Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_extract_counts_from_df_map     146.5993  209.5080  175.8946  22.3510  174.9131  17.7610       2;0  5.6852       5           1
--------------------------------------------------------------------------------------------------------------------------------------------

cheind commented 4 years ago

Merged the non-rebased one, I guess this is then obsolete?

jvlmdr commented 4 years ago

Yep! Thanks

cheind / py-motmetrics

Use dataframe.groupby instead of iterating (rebased) #62