Closed doupache closed 1 month ago
@doupache Thanks. It seems promising to integrate the Join Order Benchmark
. I look forward to take the follow-up tasks.
I think adding the join order benchmark would be reasonable.
Can someone help me understand the usual process for adding a third-party license in a Apache project ?
I would personally recommend following the model of the other benchmarks and not try and incorporate the files directly. Instead, download them on demand. If you do this I don't think we need any licensing updates
The benchmarking scripts are here: https://github.com/apache/datafusion/tree/main/benchmarks
I would recommend working on orchestrating the process using https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh
So a benchmark session might look like something like
bench.sh data job
bench.sh run job
Convert the dataset from csv.gz format to Parquet.
TPCH does something similar (convert the tsv
output of the tpch data generator to parquet)
Thanks @doupache -- this sounds very cool
[x] Convert the dataset from csv.gz
format to Parquet
. (https://github.com/apache/datafusion/pull/12497)
[x] add benchmark queries (115 queries)
[ ] Integrate the benchmark suite into dfbench
.
Is your feature request related to a problem or challenge?
JOB (Join Order Benchmark) was proposed by a research team from TUM in the paper "How Good Are Query Optimizers, Really?".
It is also used in HyPer, DuckDB, and CedarDB. It is a good benchmark for testing join ordering and join operators. It is also part of DuckDB's regression test suite.
I think if we add this test suite, it will also help with improvements like those discussed in https://github.com/apache/datafusion/issues/7955.
Describe the solution you'd like
JOB utilize the IMDB datasets. These datasets are provided in csv.gz format and represent real-world data, making them ideal for testing datafusion.
task
csv.gz
format toParquet
.dfbench
.Once everything is set up, we will be able to easily run benchmarks using the following command:
I would like to work on this! Can someone help me understand the usual process for adding a third-party license in a Apache project ?
cc @jayzhan211 @alamb
Describe alternatives you've considered
No response
Additional context
No response