alejandrohagan / 1br

1 Billion Row challenge with R
18 stars 4 forks source link

Possible error in setup #5

Closed eddelbuettel closed 8 months ago

eddelbuettel commented 9 months ago

Thanks for picking up the task and trying this. As the outcome 'does not pass the smell test' I had a really brief look in which I

  1. Generated a smaller data set with your helper script: Rscript generate_data.R 1e6
  2. Create a script 'allin.R' script (below) to compare data.table (an outlier in your results), collapse and dplyr.

My result are closer to what is expected:

edd@rob:~/git/1br(feature/simplication)$ Rscript allin.R
Unit: milliseconds
      expr     min      lq    mean  median      uq      max neval cld
 datatable 11.7441 14.2740 16.8943 15.4098 17.9822  47.4669   100  a 
  collapse 16.9737 18.9491 23.4794 20.6923 23.0194  84.8558   100  ab
     dplyr 18.8856 20.9170 26.8166 22.3052 24.4867 369.0298   100   b
edd@rob:~/git/1br(feature/simplication)$

We can also plot this thanks to ggplot2::autoplot():

image

I am also not sure if going via tidypolars is fair to rpolars. It would be better to do this directly, and maybe add the tidypolars as an alternate.

The (very short) script follows.

D <- data.table::fread("measurements.csv")

res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
                                      collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
                                      dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup())

print(res)
ggplot2::autoplot(res)
eddelbuettel commented 9 months ago

For kicks, times=10 when using 1e8 rows:

image

Unit: seconds
      expr     min      lq    mean  median      uq     max neval cld
 datatable 1.48158 1.55624 1.66194 1.61175 1.86034 1.94413    10 a  
  collapse 4.01718 4.41219 4.62006 4.71179 4.82578 5.14549    10  b 
     dplyr 2.47618 2.64897 2.83732 2.76756 2.98372 3.48487    10   c
> 
alejandrohagan commented 8 months ago

hi! First thanks for your comment!

I hear the all the comments on data.table result and I am surprised myself.

I agree when I run a smaller dataset (eg. 1e6) I get analysis similiar to yours but for some reason when I run it greater 500M rows the DT performance drops.

However others who have tried the 1e9 dataset have not been able to reproduce my findings with specific regard to data.table.

I need to understand why I am getting my data.table -- I tried to create a clean environment for each test with system2() but it still isn't working well.

I'm going to update my post on twitter / readme to clarify the data.table results are in dispute as I don't want to spread misinformation.

I'll update the polars with the code you suggested and I'll keep working the data.table angle of this.

eddelbuettel commented 8 months ago

I saw the tweet to you by @vincentarelbundock, and I also had a chance to hop onto a larger machine to validate my more traditional and standard use of benchmarking (ie directly write the expressions to measure in the call, let the benchmarking package and function do the householding) -- which complements Vincent re-run of you scriot.

In this large case of 1e9 rows, the difference between data.table and dplyr narrows, but both gain on collapse. So in short I am with Vincent here, and iterate my concern from the top of this issue ticket. You likely want to build this up more carefully.

Code

res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
                                      collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
                                      dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
avg(measurement) from stable group by state"),
                                      times = 10)

print(res)
ggplot2::autoplot(res)

Results Table

> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+                                       collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> collapse::fungroup(),
+                                       dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+                                       times = 10)
> res
Unit: seconds
      expr      min       lq     mean   median       uq      max neval
 datatable 26.47838 28.13759 30.19073 29.81849 31.90281 35.00091    10
  collapse 50.19573 50.93365 53.09633 52.23915 56.23458 57.41016    10
     dplyr 26.84754 27.93581 30.34774 30.00457 30.68157 35.96263    10
> 

Results Plot

image

alejandrohagan commented 8 months ago

Hi! thanks for taking the time to run the analysis and I really value yours and @vincentarelbundock feedback -- I am working on how to better run the analysis to avoid set up issues.

To clarify, when looking at your results, your DT and dplyr results are actually consistent with mine (eg. both are around 30 seconds).

For your collapse result -- if you run the analysis with fmean, fmax, fmin rather than the regular mean,min, and max do you still see the same outcomes?

If you update with the collapse version of the aggregation metrics, I suspect you may get similiar metrics as what I originally posted -- are you able to confirm?

In my latest sourcing, I converted each file to a bash file and run it separately with system2() because when running it directly as a call in microbenchmark I was getting feedback that each simulation may be altering the current session which I wanted to avoid.

eddelbuettel commented 8 months ago

Good catch on me having dropped the f for the three collapse aggregators. But re-running yields essentially unchanged results:

> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+                                       collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+                                       dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+                                       times = 10)
> res
Unit: seconds
      expr      min       lq     mean   median       uq      max neval
 datatable 29.06584 29.25456 30.59570 30.37740 31.55003 33.28734    10
  collapse 51.01506 51.94234 53.87729 53.78601 54.91867 59.91626    10
     dplyr 27.72498 29.45540 31.21465 30.65568 33.63867 35.84741    10
> 

image

As for using system() or system2(): there are now decades worth of careful use of benchmark, microbenchmark, bench ... all measuring from a single expression. It is common, and it is the correct thing to do. If you have side effects, rather remove the side effects? You get into new cans of worms by measuring new processes via system() or systeme(). At least consider the alternative on test runs on 1e5 or 1e6 to validate your setup.

I also used duckdb with a direct SQL expression. it is way faster than the others by an order of magnitude on this (very large) machine but it may also use more cores. One can then do the same easily with RSQLite and it is way slower (!!). I am not convinced that preloading the into a SQL database in memory is a fair comparison so I dropped it. And I was unable to write a polars expression using column measurement more than once but that may be on me. You should, I think, measure each at their strength and core use and not via a 'tidy' redirect (which may be a second, more clearly labelled, benchmark set).

eitsupi commented 8 months ago

I am not convinced that preloading the into a SQL database in memory is a fair comparison so I dropped it.

DuckDB registers R data.frame as a virtual table in the database and should not actually load the data^1, so there should be little cost to register the data in the table. https://github.com/alejandrohagan/1br/blob/83615d337d6a0684857e7f526b4bf1505bc8f40e/duckdb_and_arrow.R#L20

Of course, SQLite does not have such a feature, so copying data will always take time in SQLite.

eddelbuettel commented 8 months ago

You may have misread / misinterpreted what I wrote and meant and I did not share my code. In my exploratory tests I had an explicit DBI::dbWriteTable(dcon, "dtable", D, overwrite=TRUE) for both (here 'd' for duckdb, same with 's' for sqlite). Given that representation OLAP (ie duckdb) is seen as blazingly fast (and credit to duckdb for being very good at what it does), wheras OLTP (ie sqlite) is not. But I don't think it is an entirely fair example (though someone posted a complete DuckDB and SQL example at the 1brc site, including the data read.

My objection, really, is about measuring the comparables, ie direct by group summaries in data.table, dplyr, collapse and (where i failed) (r)polars. I am much less interested in tidy translators though that could be a second investigation.

eitsupi commented 8 months ago

I see.

By the way, as a complete aside, Polars has SQL support, so such a query can be handled completely in SQL. https://rpolars.github.io/man/SQLContext_class/

If you use the polarssql package (which I wrote) you should be able to operate via the DBI package. https://github.com/rpolars/r-polarssql

eddelbuettel commented 8 months ago

Interesting. We may be starting to misuse this thread but when I try that (with the Jammy build of polars via r-universe) I get the same issue I had before when I tried to craft this by hand: the column in question only wants to be used once. Which is ... a little odd.

> lf <- polars::pl$LazyFrame(D)
> res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement), max(measurement), avg(measurement) from frame group by state")
Error: Execution halted with the following contexts
   0: In R: in $schema():
   0: During function call [.rs.describeObject(<environment>, "res", TRUE)]
   1: Encountered the following error in Rust-Polars:
        duplicate: column with name 'measurement' has more than one occurrences

      Error originated just after this operation:
      DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"

      Error originated just after this operation:
      ErrorStateSync(AlreadyEncountered(duplicate: column with name 'measurement' has more than one occurrences

      Error originated just after this operation:
      DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"))
      DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
> 
eitsupi commented 8 months ago

My guess is that unlike most RDBMSs, Polars uses col for the column name of the result of something like func(col), so the column name of the result is duplicated. Using as and naming it differently would solve the error, wouldn't it? Like select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as avg_m from frame group by state

eddelbuettel commented 8 months ago

Oh, for completeness, I am running the 1e9 examples on a large server on campus having 512gb ram (!!) and 48 (slow-ish) Xeon cores at 2.1gb. If memory serves, data.table will default to 50% of the cores, but I do not know right now what dplyr and collapse do with all this.

eddelbuettel commented 8 months ago

Most excellent:

> res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")
> res
[1] "polars LazyFrame naive plan: (run ldf$describe_optimized_plan() to see the optimized plan)"
 SELECT [col("min_m"), col("max_m"), col("mean_m")] FROM
  AGGREGATE
    [col("measurement").min().alias("min_m"), col("measurement").max().alias("max_m"), col("measurement").mean().alias("mean_m")] BY [col("state")] FROM
    DF ["measurement", "state"]; PROJECT */2 COLUMNS; SELECTION: "None"
> 

and wicked fast (same as duckdb) when starting from the already-prepared data structure:

> system.time(res$collect())
   user  system elapsed 
 70.505  19.169   2.140 
> 

And still very respectable and even best in class when we do it all in once:

> system.time({lf <- polars::pl$LazyFrame(D); res <- polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()})
   user  system elapsed 
 82.638  29.134  22.918 
> 

I will stick this into the benchmark proper. Big thank you, as always, @eitsupi !

eddelbuettel commented 8 months ago

We have a new champion:

image

> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+                                       collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+                                       dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+                                       polars = {{lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()}},
+                                       times = 10)
> res
Unit: seconds
      expr      min       lq     mean   median       uq      max neval
 datatable 27.04999 27.57103 28.90611 29.30950 29.95487 30.76303    10
  collapse 51.02113 51.96048 53.08097 52.93228 53.61110 56.55661    10
     dplyr 28.85432 29.22231 31.12528 30.57937 33.38361 34.94457    10
    polars 23.64507 23.92012 25.25210 24.77925 26.39392 28.33606    10
> ggplot2::autoplot(res)
> 
eddelbuettel commented 8 months ago

Very impressively polars showing with and without SQL. Even if I force the data.frame-to-polars conversion on each run, it wins:

> res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),min=min(measurement),max=max(measurement)),by=state],
+                                       collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement), max=collapse::fmax(measurement), mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+                                       dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement), max=max(measurement), mean=mean(measurement)) |> dplyr::ungroup(),
+                                       polars_sql = {{lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute("select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state")$collect()}},
+                                       polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),polars::pl$col("measurement")$max()$alias("max_m"),polars::pl$col("measurement")$mean()$alias("mean_m")),
+                                       times = 10)
> res
Unit: seconds
       expr      min       lq     mean   median       uq      max neval
  datatable 26.40483 27.36088 28.79502 28.61824 30.08184 32.24413    10
   collapse 47.62153 49.16795 50.97285 50.07759 51.46359 60.20597    10
      dplyr 28.40911 29.34691 30.85193 30.55275 31.97368 34.53324    10
 polars_sql 22.37270 22.51855 24.28000 23.36114 26.76574 27.04358    10
     polars 22.41094 22.66564 23.44019 22.94000 23.88544 25.63082    10
> ggplot2::autoplot(res)
> 

image

eddelbuettel commented 8 months ago

And as the old 'it all depends' still holds here I what I get back on my (slightly aged) computer with 1e6 and 1e7:

1e6

> D <- data.table::fread("measurements.1e6.csv")
+ sqltxt <- "select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state"
+ res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),
+                                                         min=min(measurement),
+                                                         max=max(measurement)),by=state],
+                                       collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement),
+                                                                                                           max=collapse::fmax(measurement),
+                                                                                                           mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+                                       dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement),
+                                                                                               max=max(measurement),
+                                                                                               mean=mean(measurement)) |> dplyr::ungroup(),
+                                       polars_sql = {lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute(sqltxt)$collect()},
+                                       polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),
+                                                                                              polars::pl$col("measurement")$max()$alias("max_m"),
+                                                                                              polars::pl$col("measurement")$mean()$alias("mean_m")),
+                                       times = 10)
+ 
+ print(res)
> + > Unit: milliseconds
       expr     min      lq    mean  median      uq     max neval cld
  datatable 13.3638 13.7731 15.0158 14.9566 15.8077 17.4406    10 a  
   collapse 17.3992 18.0630 25.3196 23.7119 30.3520 42.5892    10  b 
      dplyr 20.3400 22.9261 25.4593 26.2902 26.8203 31.4192    10  b 
 polars_sql 32.3368 33.0572 34.8059 34.9782 35.3903 39.2898    10   c
     polars 33.0885 33.2309 35.2959 35.1821 36.4092 38.9956    10   c
> 

1e7

> D <- data.table::fread("measurements.1e7.csv")
+ sqltxt <- "select min(measurement) as min_m, max(measurement) as max_m, avg(measurement) as mean_m from frame group by state"
+ res <- microbenchmark::microbenchmark(datatable = D[ ,.(mean=mean(measurement),
+                                                         min=min(measurement),
+                                                         max=max(measurement)),by=state],
+                                       collapse =  D |> collapse::fgroup_by(state) |> collapse::fsummarise(min=collapse::fmin(measurement),
+                                                                                                           max=collapse::fmax(measurement),
+                                                                                                           mean=collapse::fmean(measurement)) |> collapse::fungroup(),
+                                       dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(min=min(measurement),
+                                                                                               max=max(measurement),
+                                                                                               mean=mean(measurement)) |> dplyr::ungroup(),
+                                       polars_sql = {lf <- polars::pl$LazyFrame(D); polars::pl$SQLContext(frame = lf)$execute(sqltxt)$collect()},
+                                       polars = polars::pl$DataFrame(D)$group_by("state")$agg(polars::pl$col("measurement")$min()$alias("min_m"),
+                                                                                              polars::pl$col("measurement")$max()$alias("max_m"),
+                                                                                              polars::pl$col("measurement")$mean()$alias("mean_m")),
+                                       times = 10)
+ 
+ print(res)
+ p1e7 <- ggplot2::autoplot(res) + ggplot2::ggtitle("N = 1e7") + tinythemes::theme_ipsum_rc()
> > Unit: milliseconds
       expr     min      lq    mean  median      uq     max neval cld
  datatable 138.280 153.186 179.441 163.553 172.607 325.932    10 a  
   collapse 387.709 401.260 437.516 426.304 475.471 526.429    10  b 
      dplyr 226.344 239.860 268.107 268.591 281.579 326.171    10   c
 polars_sql 287.727 291.987 301.697 297.225 299.797 346.353    10   c
     polars 291.714 292.332 298.029 297.448 301.144 307.805    10   c
> 

With patchwork side by side:

image

TimTaylor commented 8 months ago

We can be a little cute and use range rather than min and max directly. On 1e9 this gives me quite a reasonable saving with data.table (and I would presume dplyr).

library(data.table)

fun <- function(x) {
    range <- range(x)
    list(mean = mean(x), min = range[1], max = range[2])
}

DT <- fread("measurements.csv")

(res <- microbenchmark::microbenchmark(
    dt = DT[ ,.(mean=mean(measurement), min=min(measurement), max=max(measurement)), by=state],
    dt_range = DT[, fun(measurement), by = state],
    times = 1,
    check = "equal"
))
#> Unit: seconds
#>      expr      min       lq     mean   median       uq      max neval
#>        dt 17.52600 17.52600 17.52600 17.52600 17.52600 17.52600     1
#>  dt_range 10.23352 10.23352 10.23352 10.23352 10.23352 10.23352     1
eddelbuettel commented 8 months ago

That's good. Likeky nixes out one pass over the data.

eddelbuettel commented 8 months ago

For completeness, when I posted this on social media, @schochastics made two good points:

I am also starting to think that these should really be presented / aggregated over 1e6, 1e7, 1e8, and 1e9 rows. The latter is interesting (if alone for the comparison with 1brc) but that size seems to flatten some differences as the core time seems to be spend running over the data vector to compute the statistics (three times as @TimTaylor observed -- which can improve upon too).

schochastics commented 8 months ago

Here is my base R benchmark excluding polars (and two versions of tapply, because I was wondering if do.call is faster)

D <- data.table::fread("measurements1e6.csv", stringsAsFactors = TRUE)

sum_stats_dplyr <- function(x) {
    rg <- range(x)
    tibble::tibble(mean = mean(x), min = rg[1], max = rg[2])
}

sum_stats_list <- function(x) {
    rg <- range(x)
    list(mean = mean(x), min = rg[1], max = rg[2])
}

sum_stats_vec <- function(x) {
    rg <- range(x)
    c(mean = mean(x), min = rg[1], max = rg[2])
}

microbenchmark::microbenchmark(
    dplyr = D |> dplyr::group_by(state) |> dplyr::summarise(sum_stats_dplyr(measurement)) |> dplyr::ungroup(),
    datatable = D[, .(sum_stats_list(measurement)), by = state],
    collapse = D |> collapse::fgroup_by(state) |> collapse::fsummarise(sum_stats_dplyr(measurement)) |> collapse::fungroup(),
    aggregate = aggregate(measurement ~ state, data = D, FUN = sum_stats_vec),
    tapply1 = do.call("rbind", tapply(D$measurement, D$state, sum_stats_vec)),
    tapply2 = sapply(tapply(D$measurement, D$state, sum_stats_vec), rbind),
    lapply = {
        results <- lapply(split(D, D$state), function(x) {
            rg <- range(x$measurement)
            data.frame(
                state = unique(x$state),
                min = rg[1],
                max = rg[2],
                mean = mean(x$measurement)
            )
        })
        do.call(rbind, results)
    },
    by = by(D$measurement, D$state, sum_stats_vec), times = 10
)

base

detailed for N=1e8

Unit: seconds
      expr     min      lq     mean  median       uq      max neval
     dplyr 3.66663 4.04299  4.41022 4.34344  4.48238  5.57671    10
 datatable 3.50274 3.51561  3.90109 3.71339  4.03352  5.27914    10
  collapse 5.15419 5.27137  5.70125 5.39071  5.91769  7.32761    10
   tapply1 4.39228 4.51080  5.16972 5.07856  5.53077  6.79544    10
   tapply2 3.60726 4.35811  4.91496 4.88236  5.18694  6.52182    10
    lapply 9.28849 9.45204 10.11139 9.84460 10.78338 11.72989    10
        by 7.08288 7.38480  8.12217 7.97933  8.42832 10.36398    10
schochastics commented 8 months ago

ok there is an interesting contender:

    reduce = {
        state_list <- split(D$measurement, D$state)
        Reduce(function(x, y) {
            res <- sum_stats_vec(state_list[[y]])
            rbind(x, c(state = y, mean = res[1], min = res[2], max = res[3]))
        }, names(state_list), init = NULL)
    }

for N=1e8

Unit: seconds
      expr     min      lq    mean  median      uq     max neval
     dplyr 2.85810 3.10782 3.75953 3.30874 4.31709 6.26115    10
 datatable 2.50896 2.61430 2.91622 2.76897 3.22380 3.63510    10
  collapse 4.04716 4.13675 4.94366 4.54078 5.32954 6.82324    10
   tapply1 3.17849 3.50470 4.00139 3.77609 4.01608 5.82055    10
   tapply2 3.19185 3.41511 3.53030 3.46444 3.76154 3.89066    10
    lapply 6.36581 7.12786 7.62443 7.37824 7.96529 9.47024    10
        by 5.25023 5.42275 6.49624 6.27840 7.54884 8.45265    10
    reduce 2.86560 3.23987 4.18529 3.60106 4.89074 6.89234    10
eddelbuettel commented 8 months ago

BTW I found sorting by, say, median time helps. I found an open issue at microbenchmark (for the plot function) but it really is just the older one-liner to recreate levels in order. I quickly wrapped that into a simpler helper in a 'misc goodies' package of mine: https://github.com/eddelbuettel/dang/blob/master/R/reorderMicrobenchmarkResults.R

schochastics commented 8 months ago

sorry for spamming this issue so much. But I think this is the last for me: I remembered that Rfast exists.

rfast = {
        lev_int <- as.numeric(D$state)
        minmax <- Rfast::group(D$measurement, lev_int, method = "min.max")
        data.frame(
            state = levels(D$state),
            mean = Rfast::group(D$measurement, lev_int, method = "mean"),
            min = minmax[1, ],
            max = minmax[2, ]
        )
    }

again, N=1e8 (ordered by mean, thanks @eddelbuettel)

Unit: seconds
      expr     min      lq    mean  median      uq     max neval
     rfast 1.25014 1.41937 1.62158 1.67204 1.81019 1.97656    10
 datatable 2.24556 2.47969 2.68018 2.68014 2.91236 2.99353    10
   tapply2 2.56247 2.66205 3.07919 2.88239 3.71769 3.82766    10
    reduce 2.54882 2.70055 2.96301 2.91298 3.02862 3.86465    10
   tapply1 2.56365 2.65276 2.99696 2.97633 3.06698 4.03936    10
     dplyr 2.98505 3.22951 3.40789 3.39959 3.57909 3.83317    10
  collapse 4.02262 4.16275 4.55015 4.29150 4.66520 6.63107    10
        by 4.81229 4.98914 5.18943 5.01704 5.48522 5.98817    10
    lapply 5.85601 6.23911 6.57519 6.51415 6.61873 8.13780    10

Unfortunately I am failing to install polars on my old machine to check how it compares

lvalnegri commented 8 months ago

you can actually gain a few secs on 1e9 if you apply an additional SAC to DT:

library(data.table)
library(datasets)

n <- 1e9
set.seed(2024)

y <- data.table(  
        m = stats::rnorm(n),
        s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)

yr <- microbenchmark::microbenchmark(
        'dt ' = y[ , .(mean(m), min(m), max(m)), s],
        'dt-sac' = rbindlist(lapply( unique(y$s), \(x) data.table( s = x, y[s == x , .(mean(m), min(m), max(m))] ) )),
        times = 10
)
yr
ggplot2::autoplot(yr)

Unit: seconds
   expr      min       lq     mean   median       uq      max neval cld
    dt  25.08501 25.61196 25.82276 25.92505 26.02473 26.43565    10  a 
 dt-sac 22.35620 22.87455 23.41769 22.96819 23.25624 26.92710    10   b

plot

eddelbuettel commented 8 months ago

For completeness, and as @TimTaylor had pointed out to me in DM and at the data.table issue linked above, the development version of data.table is quite a bit faster:

Results

> res
Unit: seconds
       expr      min       lq     mean   median       uq      max neval
  datatable 11.15210 12.11368 13.50438 13.47416 14.62594 16.71623    10
     polars 22.73818 22.84256 24.19483 23.97759 24.61460 26.98668    10
 polars_sql 22.65298 22.89407 24.85488 24.19280 26.52703 28.26696    10
      dplyr 27.50917 30.36789 30.73162 31.01001 31.30612 33.00118    10
   collapse 48.79312 50.55301 53.45232 52.58460 54.01983 61.88502    10
     tapply 55.46752 56.28768 57.49995 57.24879 58.14358 61.21602    10
     lapply 71.10101 73.53757 75.22146 74.20503 77.62213 80.32287    10
         by 71.32586 72.49985 74.28661 74.38054 75.79082 78.18177    10
> 

Plot

image

This uses the current development branch of data.table, the respective changes have been in that branch for a while as far as I know.

TimTaylor commented 8 months ago

Awesome - I hadn't compared with polars so that's an impressive surprise.

grantmcdermott commented 8 months ago

@eddelbuettel I hear you on your reasons for not including DuckDB. But since Polars is using the same OLAP-type engine, albeit here embedded directly in R "dialect", I feel it's only fair that it gets a showing too. (Also to show R users, specifically, just how freaking fast it is.)

As @eitsupi says, registering a DuckDB table in memory from R is pretty quick and painless. No real difference IMO to setting/scanning a lazy Polars frame.

Otherwise, very impressive speed improvements in the latest version of data.table. I believe an official 1.15.0 release is just around the corner...

lvalnegri commented 8 months ago

duckdb memory vs duckdb disk vs data.table (dev) @eddelbuettel I really don't know how they do it, but using the disk is faster than memory...

library(data.table)
library(datasets)
library(duckdb)

n <- 1e9
set.seed(2024)

y <- data.table(  
        m = stats::rnorm(n),
        s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)

cnm <- dbConnect(duckdb())
dbWriteTable(cnm, 'obr', y, row.names = FALSE)

cnf <- dbConnect(duckdb('~/temp/obr.dkdb')) # 9GB file
dbWriteTable(cnf, 'obr', y, row.names = FALSE)

yr <- microbenchmark::microbenchmark(
        'dt' = y[ , .(mean(m), min(m), max(m)), s],
        'dkm' = dbGetQuery(cnm, 'SELECT s, min(m), max(m), avg(m) from obr group by s'),
        'dkf' = dbGetQuery(cnf, 'SELECT s, min(m), max(m), avg(m) from obr group by s'),
        times = 10
)
yr
# Unit: seconds
#  expr       min        lq      mean    median        uq       max neval cld
#    dt 11.139752 11.691545 12.244513 12.529754 12.578781 12.665636    10 a  
#   dkm  1.634721  1.655189  1.664793  1.667811  1.676581  1.683979    10  b 
#   dkf  1.151143  1.172245  1.177155  1.177535  1.185760  1.188577    10   c
ggplot2::autoplot(yr)

dbDisconnect(cnm, shutdown = TRUE)
dbDisconnect(cnf, shutdown = TRUE)

08d847f7-2b16-4cba-a9f7-a8c79357f8a6

eddelbuettel commented 8 months ago

I had these too (using a memory db), but hesitated showing them. Of course duckdb is blazingly fast here, and that is why we all love it. But the original one billion row benchmark started from csv file. So starting from a data.frame (or, rather, data.table inheritiing from data.frame) seems like a fair comparison.

Not accounting for creating the duckdb internal data structure is, to me, not fair. So I didn't include it. Of course, once you have such a table, of course grouped summaries are fast. That's what the engine is there for. duckdb has other strength too -- I just have not found a good apples-to-apples comparison. We're doing apples-to-peaches. Makes for a nice cobbler but ain't fair ...

lvalnegri commented 8 months ago

on another note, having a big machine does not seem to be such a big deal with data.table (unless I got something wrong)

setDTT <- \(x) { setDTthreads(x); y[ , .(mean(m), min(m), max(m)), s] }
yr <- microbenchmark::microbenchmark(
    'dt06' = setDTT( 6),
    'dt12' = setDTT(12),
    'dt18' = setDTT(18),
    'dt24' = setDTT(24),
    'dt30' = setDTT(30),
    times = 10
)
yr
# Unit: seconds
#  expr      min       lq     mean   median       uq      max neval cld
#  dt06 11.74759 12.38181 12.69686 12.72297 13.02676 13.68092    10 a  
#  dt12 11.03751 11.44846 12.24313 12.24350 12.94132 13.41429    10 ab 
#  dt18 10.17322 10.31603 10.86685 10.86211 11.32207 11.55271    10   c
#  dt24 10.62439 10.77381 11.47472 11.37302 11.96992 13.27930    10  bc
#  dt30 10.12770 10.32557 11.25893 11.30042 11.73645 12.94521    10   c
ggplot2::autoplot(yr)

56186056-a895-4ec7-ab7a-e34687ab436f

TimTaylor commented 8 months ago

@eddelbuettel - I think the closest comparison (if you wanted to go that route) would be using the duckdb_register() function and including that as part of the timing (I'm not a ducks user but @eitsupi highlighted it earlier).

That said - constraints are good... For my own timings I've made use of {inline} and a simple compiled loop ... 'tis all still R code but doesn't fit with the spirit so 🤷‍♂️

lvalnegri commented 8 months ago

Not accounting for creating the duckdb internal data structure is, to me, not fair.

Setting up an index beforehand on the state column gets data.table nearer the duckdb result, with also differences in cores used now appreciated to some extent:

setkey(y, 's')
yr <- microbenchmark::microbenchmark(
    'dt06' = setDTT( 6),
    'dt12' = setDTT(12),
    'dt18' = setDTT(18),
    'dt24' = setDTT(24),
    'dt30' = setDTT(30),
    times = 10
)
yr
# Unit: seconds
#  expr      min       lq     mean   median       uq      max neval cld
#  dt06 8.124833 8.681474 8.782995 8.877845 8.980673 9.372818    10 a    
#  dt12 7.016027 7.191650 7.522454 7.543799 7.796921 8.077547    10  b   
#  dt18 5.549116 5.945838 6.280376 6.316387 6.623486 6.734678    10   c  
#  dt24 4.905373 4.976062 5.426632 5.450692 5.854341 5.948980    10    d 
#  dt30 4.421263 4.525206 4.723312 4.593708 4.632048 5.478117    10     e
ggplot2::autoplot(yr)

55481d05-30a2-4243-8a17-b24aca3b138e

eddelbuettel commented 8 months ago

using the disk is faster than memory...

I think what you see here is the Linux kernel helping you turning disk into memory by caching ...

eddelbuettel commented 8 months ago

The threading example is good. As is using an index -- we'd get that via keyby=state instead of by=state too methinks.

jangorecki commented 8 months ago

We can be a little cute and use range rather than min and max directly. On 1e9 this gives me quite a reasonable saving with data.table

It generally should not, because range is not GForce optimized. Are you by any chance on an older version of DT? using verbose=T is quite useful in such cases

TimTaylor commented 8 months ago

It generally should not, because range is not GForce optimized. Are you by any chance on an older version of DT? using verbose=T is quite useful in such cases

Yes initially it was 1.14.10 but now switched to master and direct (min max) approach is much faster.

ben-schwen commented 8 months ago

Maybe a bit late to the party but here we go:

Setup from lvalnegri sample here has 50 fifty different groups

library(datasets)
library(data.table)
n <- 1e8
set.seed(2024)
y <- data.table(  
        m = stats::rnorm(n),
        s = base::sample(datasets::state.abb, size = n, replace = TRUE)
)

setkey(y, 's')
setDTT <- \(x) { setDTthreads(x); y[ , .(mean(m), min(m), max(m)), s] }
yr <- microbenchmark::microbenchmark(
    'dt02' = setDTT(2),
    'dt04' = setDTT(4),
    'dt06' = setDTT(6),
    'dt08' = setDTT(8),
    'dt10' = setDTT(10),
    times = 10
)
yr

# 1.14.10
# Unit: seconds
#  expr      min       lq     mean   median       uq      max neval cld
#  dt02 3.003172 3.074757 3.148601 3.138230 3.240320 3.311348    10 a  
#  dt04 2.751539 2.840537 2.906491 2.877485 2.963776 3.119828    10  b 
#  dt06 2.699096 2.802055 2.850967 2.848263 2.914460 3.045719    10  b 
#  dt08 2.663448 2.682338 2.805176 2.787870 2.860383 3.063182    10  bc
#  dt10 2.621527 2.682646 2.703065 2.705433 2.739840 2.802580    10   c

# 1.14.99 aka 1.15.0
# Unit: milliseconds
#  expr      min       lq     mean   median       uq      max neval  cld
#  dt02 1401.196 1515.518 1546.238 1542.442 1600.611 1670.684    10 a   
#  dt04 1106.861 1290.204 1318.007 1303.570 1365.536 1450.535    10  b  
#  dt06 1074.459 1087.036 1143.726 1128.395 1223.271 1235.099    10   c 
#  dt08 1036.661 1042.637 1116.963 1127.266 1176.382 1195.730    10   cd
#  dt10  967.258 1021.693 1041.920 1026.675 1043.143 1143.767    10    d

# gmin parallelized per group
# Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval cld
#  dt02 1221.1177 1264.3856 1316.9177 1311.3068 1392.7166 1416.7932    10  a 
#  dt04  844.3419  910.7867 1144.4065  941.2959 1035.7934 2937.8885    10  ab
#  dt06  767.6377  787.4808  833.9695  812.1181  842.2414 1011.9838    10   b
#  dt08  785.2025  802.7681  847.5161  814.6925  848.3238 1024.4026    10   b
#  dt10  697.5232  753.0673  809.0413  799.7735  883.6270  907.5998    10   b

With more groups or more cores the speedup from gmin parallelized to 1.15.0 should be even higher

alejandrohagan commented 8 months ago

hey everyone, especially @eddelbuettel, @eitsupi, @TimTaylor and @schochastics for all the discussion!

I had business travel to Thailand and couldn't work this and quite frankly the skill set of the conversation is way over my skill level.

What I basically have concluded that it is a bit meaningless to call something a "benchmark" unless there are consistent or clear standards of said benchmark.

Seeing the various results, I noticed there can be considerable variability in running the platforms which suggests that in addition to R environment / package issues, the underlying machine and configuration can add significant variability to results.

The lesson for me is that different packages can perform differently depending on how you set them up, the datasets structure and your machine's configurations -- however trying to declare a "winner" is futile unless there the benchmark standards are clear.

Thank you everyone for spending time and effort in helping to understand the issue. I do eventually want to do write up on the learning here but I won't have time any time soon.

jangorecki commented 8 months ago

btw. this is good read: https://rdatatable.gitlab.io/data.table/articles/datatable-benchmarking.html basics mostly but still important

jrosell commented 7 months ago

I changed the approach of this benchmark by including the read operation in the benchmark. I have also added multiple duckdb options. If you can and want, you can try 1e9 as I detailed in my repo github.com/jrosell/1br

This is what I get for the 1e8 benchmark: 2024-02-21_1e8_rows

Feedback is welcome.