apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

[Epic] Improved DataFusion Benchmarking #5505

Open alamb opened 1 year ago

alamb commented 1 year ago

Call to action:

Let's invest more effort in DataFusion benchmarking, both as a mechanism for technical evangelism as well as a guide for actual performance improvements.

Background

We have several examples of performance “comparisons” showing DataFusion not doing well against DuckDB or pola.rs that really was a test of how fast CSV or JSON parsing can go (this blog is one such example) – recent work should make these comparisons much more favorable in the future

It is in the interest of all projects based on DataFusion to focus on their own users and use cases rather than having to explain why they are using supposedly "inferior" technology due to misleading benchmark results (for example recently on ClickBench – see https://github.com/apache/arrow-datafusion/issues/5276).

Of course not only will improved benchmarking help evangelize DataFusion more, it will also directly help guide the community’s optimization efforts.

Related Tickets

comphead commented 1 year ago

@alamb 5276 included twice

jackwener commented 1 year ago

@alamb 5276 included twice

Thanks for reminder, I have removed it.

alamb commented 1 year ago

Here is a proposed PR to orchestrate running the benchmarks: https://github.com/apache/arrow-datafusion/pull/6131