Open alamb opened 1 year ago
I would like to work on this
Thank you @palash25
sorry for the inactivity on this. my RSI came back so i was taking a break from typing, i will try to submit the PR in a day or two.
No problem -- I hope you feel better soon
Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.
Is this something that's still wanted? I took a look at doing this but it looks like the data isn't hosted on the benchmark repo, just data gen scripts in R.
I think it would be useful. Thank you
I think figuring out how to generate the data locally would be super valuable -- perhaps we can use a docker like approach as we do for tpch:
So it would run like
./bench.sh data h2o
Which would leave data in datafusion/benchmarks/data/h2o
🤔
Is your feature request related to a problem or challenge?
Follow on to #7052 There is an interesting database benchark called "H20.ai database like benchmark" that DuckDB seems to have revived (perhaps because the original went dormant with very old with very old/ slow duckdb results). More background here: https://duckdb.org/2023/04/14/h2oai.html#results
@Dandandan added a new solution for datafusion here: https://github.com/duckdblabs/db-benchmark/pull/18
However, there is no easy way to run the h2o benchmark within the datafusion repo. There is an old version of some of these benchmarks in the code: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/h2o.rs
Describe the solution you'd like
I would like someone to make it easy to run the h20.ai benchmark in the datafusion repo.
Ideally this would look like
I would expect to be able to run the individual queries like this
Some steps might be
bench.sh
, following the model of existing benchmarksDescribe alternatives you've considered
We could also simply remove the h20.ai benchmark script as it is not clear how important it will be long term
Additional context
I think this is a good first issue as the task is clear, and there are existing patterns in
bench.sh
,dfbench
and in