Open grantmcdermott opened 1 year ago
Hi Grant, Thank you for the suggestion!
I currently don't have a lot of bandwidth to add a whole new solution to the benchmark, but if you want to open a PR that adds the necessary setup-collapse.sh
, ver-collapse.sh
, upg-collapse.sh
, groupby-collapse.R
, and join-collapse.R
then I'd be happy to review. A good place to start would be copying the files in the dplyr folder in the benchmark, and just change the imported libraries. That will probably get you more than halfway.
See repro.sh for steps to run the benchmark either locally or on an AWS instance. If no errors are thrown for the 0.5GB & 5GB datasets I'd be happy to merge your PR and re-run the benchmark to include results for collapse.
As for the reshaping benchmarks, I think its a great idea!
It would take a while to finally include those queries in the benchmark, however, as I would need to
I would like to do a re-work of the report generation code, as it was hard to track down bugs while re-running the benchmark. As mentioned in https://github.com/h2oai/db-benchmark/issues/175, however, I would be happy to review or collaborate any PRs that help maintain and improve the benchmark!
With join (and reshape) capabilities coming to the impending collapse 2.0 release, it looks like a good time to get a PR ready for this issue. @SebKrantz would you like me to take a stab at this and ping you when its ready? Or do you just want to handle it yourself? Two thoughts / questions:
collapse author here. Thanks @grantmcdermott and @vincentarelbundock for the initiative! I'm happy with adding collapse to the benchmarks, and also happy for any suggested code, but would like to wait for the pending v2.0 release (which includes implementations of table joins and reshaping). I will also ensure the benchmarking code is equivalent to other DBMS (collapse has some unfavorable defaults e.g. sort = TRUE
, na.rm = TRUE
, nthreads = 1
). I expect v2.0 to be released within 1 month, and will then get back to this and submit a comprehensive PR, integreating what was suggested here.
Sounds good @SebKrantz.
You may want to use my PR as a starting point since most of the setup and group-by stuff is close to done.
FYI, the dplyr
and data.table
benchmarks use na.rm=TRUE
, but you are right that the sort
and nthreads
arguments may need to be adjusted.
Stoked to see this back up and running!
(As an aside, the relentless performance gains of DuckDB are truly impressive.)
Two suggestions:
Please consider the collapse R package (link). In my own set of benchmarks, collapse is typically at or near the top of various groupby operations for datasets in the order of .5-5 GB. (I haven't tested larger than that and should also say it doesn't support join operations yet.) I can add a PR if interested.Closed via #33.Thanks again for all effort in resurrecting this.