Let's invest more effort in DataFusion benchmarking, both as a mechanism for technical evangelism as well as a guide for actual performance improvements.
Background
We have several examples of performance “comparisons” showing DataFusion not doing well against DuckDB or pola.rs that really was a test of how fast CSV or JSON parsing can go (this blog is one such example) – recent work should make these comparisons much more favorable in the future
It is in the interest of all projects based on DataFusion to focus on their own users and use cases rather than having to explain why they are using supposedly "inferior" technology due to misleading benchmark results (for example recently on ClickBench – see https://github.com/apache/arrow-datafusion/issues/5276).
Of course not only will improved benchmarking help evangelize DataFusion more, it will also directly help guide the community’s optimization efforts.
Call to action:
Let's invest more effort in DataFusion benchmarking, both as a mechanism for technical evangelism as well as a guide for actual performance improvements.
Background
We have several examples of performance “comparisons” showing DataFusion not doing well against DuckDB or pola.rs that really was a test of how fast CSV or JSON parsing can go (this blog is one such example) – recent work should make these comparisons much more favorable in the future
It is in the interest of all projects based on DataFusion to focus on their own users and use cases rather than having to explain why they are using supposedly "inferior" technology due to misleading benchmark results (for example recently on ClickBench – see https://github.com/apache/arrow-datafusion/issues/5276).
Of course not only will improved benchmarking help evangelize DataFusion more, it will also directly help guide the community’s optimization efforts.
Related Tickets