duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
143 stars 27 forks source link

Include datafusion in the benchmark #5

Closed kszlim closed 9 months ago

kszlim commented 1 year ago

Datafusion is another stateless query engine/dataframe library I'd be interested in seeing results for.

https://github.com/apache/arrow-datafusion

Tmonster commented 1 year ago

Hi Kevin, thanks for the suggestion!

I currently don't have a lot of bandwidth to add a whole new solution to the benchmark, but if you want to open a PR that adds the necessary setup-datafusion.sh, ver-datafusion.sh, upg-datafusion.sh, groupby-datafusion.rs, and join-datafusion.rs then I'd be happy to review. Take a look at files in the other solution folders and that should give you a good idea of what is necessary. Although it may require more steps as datafusion doesn't have any R or python APIs, so you may also need to add/modify some files in _launcher and _helpers

See repro.sh for steps to run the benchmark either locally or on an AWS instance. If no errors are thrown for the 0.5GB & 5GB datasets I'd be happy to merge your PR and re-run the benchmark to include results for datafusion.

kszlim commented 1 year ago

There is actually a python api, though it's not documented well: https://github.com/apache/arrow-datafusion-python

If i have time i'll try to port the benchmarks to it.

MrPowers commented 1 year ago

Looks like almost all of this work is done already: https://github.com/apache/arrow-datafusion/tree/main/benchmarks/db-benchmark

Would you like to add the PR @kszlim or would you like me to take a stab?

kszlim commented 1 year ago

Go ahead, I don't have the time!

Tmonster commented 1 year ago

@MrPowers Was just looking at this again. Looks like the db benchmark for data fusion is here now? https://github.com/apache/arrow-datafusion-python/tree/main/benchmarks/db-benchmark Would you still like to open a PR? Some of the files have benchmark initialization setup, so that would need to be trimmed, but I don't think it would be much work

hkpeaks commented 1 year ago

@kszlim I feel interest to include datafusion in coming benchmarking https://github.com/duckdblabs/db-benchmark/issues/13#issuecomment-1578079219, is it support streaming (data large than memory scenario)?

kszlim commented 1 year ago

@kszlim I feel interest to include datafusion in coming benchmarking https://github.com/duckdblabs/db-benchmark/issues/13#issuecomment-1578079219, is it support streaming (data large than memory scenario)?

The rust library does, I'm not sure if the python bindings expose it.

Dandandan commented 1 year ago

Closed by #18

kszlim commented 9 months ago

Thanks!