apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.51k stars 1.02k forks source link

(Slightly) Slower reported performance on ClickBenck benchmarks in DataFusion 34.0.0 than DataFusion 33.0.0 #8836

Open alamb opened 5 months ago

alamb commented 5 months ago

Describe the bug

As part of https://github.com/apache/arrow-datafusion/issues/8789, @kmitchener ran the ClickBench results using DataFusion 34.0.0 and compared to DataFusion 33.0.0 they appear to go slightly slower.

I would like to know why the benchmark shows it going slightly slower

To Reproduce

He ran the v33 benchmarks on the same instance and modified the benchmark so it will display both 33 and 34 at the same time so you can compare the runs: image

You can grab that from -> https://github.com/kmitchener/ClickBench/blob/new-run-of-datafusion-33/index.html

Expected behavior

Each release should be as good or better than the last

Additional context

No response

Dandandan commented 5 months ago

I wonder if this is really slower or it is just noise.

Note that the benchmark runs on c6a.4xlarge and EBS (gp2), which contribute to variations in performance (i.e. load from other users).

alamb commented 5 months ago

I wonder if this is really slower or it is just noise.

Note that the benchmark runs on c6a.4xlarge and EBS (gp2), which contribute to variations in performance (i.e. load from other users).

I wondered the same thing but @kmitchener seems to have been able to reproduce the difference reliably https://github.com/apache/arrow-datafusion/issues/8789#issuecomment-1883645578 🤔

alamb commented 3 months ago

Update here is that we see the same small slowdown in version 36.

I was thinking perhaps it could be due to the overhead of reading/parsing per-file metadata. More details here: https://github.com/apache/arrow-datafusion/issues/9404#issuecomment-1986804684