duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
149 stars 30 forks source link

Suggestion: Include reshape benchmarks #3

Open grantmcdermott opened 1 year ago

grantmcdermott commented 1 year ago

Stoked to see this back up and running!

(As an aside, the relentless performance gains of DuckDB are truly impressive.)

Two suggestions:

  1. Please consider the collapse R package (link). In my own set of benchmarks, collapse is typically at or near the top of various groupby operations for datasets in the order of .5-5 GB. (I haven't tested larger than that and should also say it doesn't support join operations yet.) I can add a PR if interested. Closed via #33.
  2. There was talk over at the old repo of adding a set of reshape benchmarks. Personally, I think this would be great to have. See: https://github.com/h2oai/db-benchmark/issues/175

Thanks again for all effort in resurrecting this.

Tmonster commented 1 year ago

Hi Grant, Thank you for the suggestion!

I currently don't have a lot of bandwidth to add a whole new solution to the benchmark, but if you want to open a PR that adds the necessary setup-collapse.sh, ver-collapse.sh, upg-collapse.sh, groupby-collapse.R, and join-collapse.R then I'd be happy to review. A good place to start would be copying the files in the dplyr folder in the benchmark, and just change the imported libraries. That will probably get you more than halfway.

See repro.sh for steps to run the benchmark either locally or on an AWS instance. If no errors are thrown for the 0.5GB & 5GB datasets I'd be happy to merge your PR and re-run the benchmark to include results for collapse.

Tmonster commented 1 year ago

As for the reshaping benchmarks, I think its a great idea!

It would take a while to finally include those queries in the benchmark, however, as I would need to

  1. Create new queries and datasets. (Although I believe the group by datasets could work well for this)
  2. Create new reshape-solution.* scripts for each of the solutions that support reshaping functionality
  3. Modify the report generation code to include reshape results

I would like to do a re-work of the report generation code, as it was hard to track down bugs while re-running the benchmark. As mentioned in https://github.com/h2oai/db-benchmark/issues/175, however, I would be happy to review or collaborate any PRs that help maintain and improve the benchmark!

grantmcdermott commented 1 year ago

With join (and reshape) capabilities coming to the impending collapse 2.0 release, it looks like a good time to get a PR ready for this issue. @SebKrantz would you like me to take a stab at this and ping you when its ready? Or do you just want to handle it yourself? Two thoughts / questions:

  1. One easy solution is to take the existing dplyr implementation and adapt the "f"(ast)-prefix versions (group_by -> fgroup_by etc). Or would you prefer idiomatic collapse code?
  2. Is there a limit to the number of threads we should impose?
SebKrantz commented 1 year ago

collapse author here. Thanks @grantmcdermott and @vincentarelbundock for the initiative! I'm happy with adding collapse to the benchmarks, and also happy for any suggested code, but would like to wait for the pending v2.0 release (which includes implementations of table joins and reshaping). I will also ensure the benchmarking code is equivalent to other DBMS (collapse has some unfavorable defaults e.g. sort = TRUE, na.rm = TRUE, nthreads = 1). I expect v2.0 to be released within 1 month, and will then get back to this and submit a comprehensive PR, integreating what was suggested here.

vincentarelbundock commented 1 year ago

Sounds good @SebKrantz.

You may want to use my PR as a starting point since most of the setup and group-by stuff is close to done.

FYI, the dplyr and data.table benchmarks use na.rm=TRUE, but you are right that the sort and nthreads arguments may need to be adjusted.