Add DataFusion solution

Dandandan commented 1 year ago

@Tmonster this is ready now :)

Tmonster commented 1 year ago

@Dandandan was working the past week on adding a github action to run a mini benchmark https://github.com/duckdblabs/db-benchmark/pull/20 anytime someone makes a pull request. Should be able to merge the github action script later today, then I'll take another look here.

Dandandan commented 1 year ago

Cool, thank you @Tmonster - it should be mostly done by now, only having to remove the for-loops and having the code one by one.

Tmonster commented 1 year ago

Hi @Dandandan the GitHub actions script has been merged. If you merge with master, push the last changes and get a green check mark, this PR is good to merge!

Dandandan commented 1 year ago

Hi @Dandandan the GitHub actions script has been merged. If you merge with master, push the last changes and get a green check mark, this PR is good to merge!

Awesome!

Dandandan commented 1 year ago

@Tmonster can we run it again?

Dandandan commented 1 year ago

@Tmonster CI passed 🥳

Tmonster commented 1 year ago

@Dandandan Great! I looked at some of the .err files myself. It seems like some of the queries were killed (groupie Q10, join Q4). I'm going to merge anyway, since most solutions don't pass all queries. Just wanted to let you know 👍

Dandandan commented 1 year ago

@Tmonster awesome 👍 let me know when results are published

jangorecki commented 1 year ago

@Dandandan do you think you could provide data fussion script for rolling statistics task that is being developed in #9 ? I looked at data fussion docs and it seems to support those.

Dandandan commented 1 year ago

Yes, this would probably be possible using window functions

alamb commented 1 year ago

I noticed that the https://duckdblabs.github.io/db-benchmark/ does not yet have the results of the datafusion run

Perhaps due to jangorecki's comments

@jangorecki is there something datafusion specific that we can help with? I looked at the instructions on https://github.com/duckdblabs/db-benchmark#reproduce but I am not familiar with R

jangorecki commented 1 year ago

not because of my comment. duck's team said they are going to run benchmark in september.

ensuring it runs smoothly after datafusion merge is likely to reduce waiting time. otherwise duck's team need to debug and fix missing/incorrect pieces, potentially if it takes too much time, then excluding problematic solution (this is what I used to do when I needed benchmark timings but didn't have time to fix breaking changes in some of the tools. although I run benchmark more frequently so skipping a software once was less a problem).

benchmark can be easily run on a laptop using only 1e7 data sizes (config in _control/data/csv, column active). Steps to reproduce are included in the repo. Knowledge of R is not necessary to run it.

Tmonster commented 1 year ago

Exactly what Jan said. Duckdb Is planning a release for september 11. At that point we will run the benchmarks again for all solutions

https://duckdb.org/dev/release-dates (Duckdb is working on making this more visible)

If you are wondering about how datafusion will compare, you can take a look at reproducing the environment using the (https://github.com/duckdblabs/db-benchmark/blob/master/.github/workflows/regression.yml)[regression.yml] file.

This can be run on most amazon ubuntu 22.04 boxes. Then you just need to generate the data and create the report. One thing I have noticed about datafusion, is that it is not finishing the benchmarks for join or groupby in the github actions. see https://github.com/duckdblabs/db-benchmark/actions/runs/5769687734/job/15641984997 under validate benchmarks, the datafusion output doesn't print "[joining|grouping] finished, took X s" like the other solutions. This usually indicated that the process was killed by the OOM reaper.

@Dandandan maybe you would like to take a look at this as well. Since all of the other solutions complete, I imagine datafusion should be able to complete as well.

jangorecki commented 1 year ago

Few issues I spotted and mentioned in this PR here have been fixed by me in #9, so if you want to run whole benchmark, then it may be easier using #9 (or just cherry pick those changes). #9 should be good to merge, runs fine on a laptop, next week will run it on aws, and then confirm it is ready to merge.

Dandandan commented 1 year ago

Thanks @jangorecki @Tmonster for the comments

I checked the DataFusion solution to be runnable (and it passed CI) when implementing this solution but I missed the R syntax. Thanks @jangorecki for fixing this in #9 !

I probably won't have time to run & compare solutions yet, so I think we can either wait on @Tmonster to run the new solution or someone else has to step up.

@Tmonster DataFusion 28 (Python bindings) were released yesterday, which improves memory usage in case of high cardinality grouping, maybe that resolves the OOM situation for DataFusion.

jangorecki commented 1 year ago

BTW. If one wants to stress memory usage then G1_1e7_2e0_0_0 (k=2) dataset can be used. AFAIK duck's team runs only default G1_1e7_1e2_0_0 (k=100). Multiple solutions failed (1e8 and 1e9 rows) with k=2 while passing with the default k=100.

MrPowers commented 11 months ago

@jangorecki - so happy to see you contributing to this repo. Your work on this initiative has been so helpful to me over the years.

I am reading through this comment thread and it seems like we're good to go now and all issues have been resolved.

Duckdb Is planning a release for september 11

Do you know if this was released? Will DataFusion be included the next time the benchmarks are run?

jangorecki commented 11 months ago

@MrPowers no idea, I am unfortunately not associated with duckdb. AFAIK coming duckdb release is a big milestone so delays will be quite natural.

duckdblabs / db-benchmark

Add DataFusion solution #18