Add IMDB queries (a.k.a. JOB - Join Order Benchmark) to DataFusion benchmark suite

doupache commented 2 months ago

Is your feature request related to a problem or challenge?

JOB (Join Order Benchmark) was proposed by a research team from TUM in the paper "How Good Are Query Optimizers, Really?".

It is also used in HyPer, DuckDB, and CedarDB. It is a good benchmark for testing join ordering and join operators. It is also part of DuckDB's regression test suite.

I think if we add this test suite, it will also help with improvements like those discussed in https://github.com/apache/datafusion/issues/7955.

Describe the solution you'd like

JOB utilize the IMDB datasets. These datasets are provided in csv.gz format and represent real-world data, making them ideal for testing datafusion.

task

[ ] Convert the dataset from csv.gz format to Parquet.
[ ] Add the IMDB license to the LICENSE.
[ ] add benchmark queries.
[ ] Integrate the benchmark suite into dfbench.

Once everything is set up, we will be able to easily run benchmarks using the following command:

cargo run  --bin dfbench --imdb --query=5

I would like to work on this! Can someone help me understand the usual process for adding a third-party license in a Apache project ?

cc @jayzhan211 @alamb

Describe alternatives you've considered

No response

Additional context

No response

austin362667 commented 1 month ago

@doupache Thanks. It seems promising to integrate the Join Order Benchmark. I look forward to take the follow-up tasks.

alamb commented 1 month ago

I think adding the join order benchmark would be reasonable.

Can someone help me understand the usual process for adding a third-party license in a Apache project ?

I would personally recommend following the model of the other benchmarks and not try and incorporate the files directly. Instead, download them on demand. If you do this I don't think we need any licensing updates

The benchmarking scripts are here: https://github.com/apache/datafusion/tree/main/benchmarks

I would recommend working on orchestrating the process using https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh

So a benchmark session might look like something like

bench.sh data job
bench.sh run job

Convert the dataset from csv.gz format to Parquet.

TPCH does something similar (convert the tsv output of the tpch data generator to parquet)

alamb commented 1 month ago

Thanks @doupache -- this sounds very cool

doupache commented 1 month ago

[x] Convert the dataset from csv.gz format to Parquet. (https://github.com/apache/datafusion/pull/12497)
[x] add benchmark queries (115 queries)
- [ ] part1 1a.sql ~ 12c.sql @austin362667
- [ ] part2 13a.sql ~ 24b.sql @pingsutw
- [ ] part3 25a.sql ~ 33c.sql @doupache
[ ] Integrate the benchmark suite into dfbench.

apache / datafusion