apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.29k stars 973 forks source link

feat: Determine ordering of file groups #9593

Closed suremarc closed 1 month ago

suremarc commented 2 months ago

Which issue does this PR close?

Closes #7490 .

Rationale for this change

See details in #7490 - this feature helps DataFusion eliminate sorts when files can be shown to be non-overlapping in terms of min/max statistics.

What changes are included in this PR?

Are these changes tested?

Yes - there is a unit test and a sqllogictest.

Are there any user-facing changes?

Yes - there is a new optional statistics field in PartitionedFile, which is part of the proposal in #7490.

There is also the new FileScanConfig::sort_file_groups API

Dandandan commented 2 months ago

/benchmark

Dandandan commented 2 months ago

/benchmark

github-actions[bot] commented 2 months ago

Benchmark results

Benchmarks comparing 3d130915b418931696dbf682106fd3c751677c91 (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR) ``` Comparing 3d13091 and cca5f0f Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/3d13091/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist ```
Dandandan commented 2 months ago

/benchmark

Dandandan commented 2 months ago

@suremarc sorry for the noise, just trying to run the benchmark command!

github-actions[bot] commented 2 months ago

Benchmark results

Benchmarks comparing 6e90f01c413a7b9178fbb0757c69e3d6754474be (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR) ``` Comparing 6e90f01 and cca5f0f Note: Skipping /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/6e90f01/tpch.json as /home/runner/work/arrow-datafusion/arrow-datafusion/benchmarks/results/cca5f0f/tpch.json does not exist ```
Dandandan commented 2 months ago

/benchmark

github-actions[bot] commented 2 months ago

Benchmark results

Benchmarks comparing 015cfe864e3b813d05f8bf513a127696ba28ec7f (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR) ``` Comparing 015cfe8 and cca5f0f -------------------- Benchmark tpch.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┓ ┃ Query ┃ 015cfe8 ┃ cca5f0f ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━┩ │ QQuery 1 │ 433.61ms │ 437.25ms │ no change │ │ QQuery 2 │ 58.77ms │ 58.63ms │ no change │ │ QQuery 3 │ 143.19ms │ 141.75ms │ no change │ │ QQuery 4 │ 87.73ms │ 90.93ms │ no change │ │ QQuery 5 │ 196.32ms │ 198.59ms │ no change │ │ QQuery 6 │ 102.49ms │ 138.43ms │ 1.35x slower │ │ QQuery 7 │ 280.24ms │ 295.44ms │ 1.05x slower │ │ QQuery 8 │ 199.24ms │ 191.96ms │ no change │ │ QQuery 9 │ 292.84ms │ 299.26ms │ no change │ │ QQuery 10 │ 231.49ms │ 236.37ms │ no change │ │ QQuery 11 │ 62.28ms │ 62.52ms │ no change │ │ QQuery 12 │ 127.02ms │ 127.75ms │ no change │ │ QQuery 13 │ 188.92ms │ 184.02ms │ no change │ │ QQuery 14 │ 128.87ms │ 127.64ms │ no change │ │ QQuery 15 │ 188.04ms │ 198.36ms │ 1.05x slower │ │ QQuery 16 │ 52.05ms │ 51.22ms │ no change │ │ QQuery 17 │ 301.30ms │ 297.40ms │ no change │ │ QQuery 18 │ 437.03ms │ 431.55ms │ no change │ │ QQuery 19 │ 227.47ms │ 228.29ms │ no change │ │ QQuery 20 │ 189.35ms │ 191.53ms │ no change │ │ QQuery 21 │ 323.42ms │ 327.57ms │ no change │ │ QQuery 22 │ 54.84ms │ 55.18ms │ no change │ └──────────────┴──────────┴──────────┴──────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Total Time (015cfe8) │ 4306.50ms │ │ Total Time (cca5f0f) │ 4371.63ms │ │ Average Time (015cfe8) │ 195.75ms │ │ Average Time (cca5f0f) │ 198.71ms │ │ Queries Faster │ 0 │ │ Queries Slower │ 3 │ │ Queries with No Change │ 19 │ └────────────────────────┴───────────┘ ```
Dandandan commented 2 months ago

Running /benchmark once more to see if previous run was noise

gruuya commented 2 months ago

Fwiw, I've noticed that it seem to have a minor systematic bias against the PR results in general (even if e.g. there are literally no code changes in it). This manifests like above, by reporting a couple of queries being marginally slower.

I think this will be alleviated once a dedicated runner is used, and/or when more benchmark types are included.

github-actions[bot] commented 2 months ago

Benchmark results

Benchmarks comparing 015cfe864e3b813d05f8bf513a127696ba28ec7f (main) and cca5f0fbc41b5fd58d7b9f2ce1cafccafa8599c9 (PR) ``` Comparing 015cfe8 and cca5f0f -------------------- Benchmark tpch.json -------------------- ┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ 015cfe8 ┃ cca5f0f ┃ Change ┃ ┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 433.40ms │ 444.84ms │ no change │ │ QQuery 2 │ 58.23ms │ 59.08ms │ no change │ │ QQuery 3 │ 142.62ms │ 143.21ms │ no change │ │ QQuery 4 │ 88.02ms │ 87.36ms │ no change │ │ QQuery 5 │ 195.23ms │ 196.83ms │ no change │ │ QQuery 6 │ 103.54ms │ 103.05ms │ no change │ │ QQuery 7 │ 279.11ms │ 277.24ms │ no change │ │ QQuery 8 │ 200.55ms │ 198.16ms │ no change │ │ QQuery 9 │ 296.86ms │ 291.54ms │ no change │ │ QQuery 10 │ 234.67ms │ 323.08ms │ 1.38x slower │ │ QQuery 11 │ 63.09ms │ 89.66ms │ 1.42x slower │ │ QQuery 12 │ 125.98ms │ 127.29ms │ no change │ │ QQuery 13 │ 178.71ms │ 180.05ms │ no change │ │ QQuery 14 │ 127.44ms │ 129.93ms │ no change │ │ QQuery 15 │ 186.88ms │ 191.55ms │ no change │ │ QQuery 16 │ 52.93ms │ 49.94ms │ +1.06x faster │ │ QQuery 17 │ 310.79ms │ 291.79ms │ +1.07x faster │ │ QQuery 18 │ 442.30ms │ 447.07ms │ no change │ │ QQuery 19 │ 224.82ms │ 228.82ms │ no change │ │ QQuery 20 │ 189.15ms │ 187.59ms │ no change │ │ QQuery 21 │ 325.20ms │ 329.43ms │ no change │ │ QQuery 22 │ 53.58ms │ 55.04ms │ no change │ └──────────────┴──────────┴──────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Total Time (015cfe8) │ 4313.09ms │ │ Total Time (cca5f0f) │ 4432.56ms │ │ Average Time (015cfe8) │ 196.05ms │ │ Average Time (cca5f0f) │ 201.48ms │ │ Queries Faster │ 2 │ │ Queries Slower │ 2 │ │ Queries with No Change │ 18 │ └────────────────────────┴───────────┘ ```
Dandandan commented 2 months ago

Yeah seems noise, different queries running slower in second run.

NGA-TRAN commented 2 months ago

I will review this either today or tomorrow

suremarc commented 2 months ago

@alamb @NGA-TRAN

Thanks for the swift action - I guess I'll go ahead and mark this as ready for review since it's being reviewed, haha.

I'm still planning to add an end-to-end test with ListingTable but just haven't figured out how to do it yet

NGA-TRAN commented 2 months ago

Sorry I could not get to this today. I will try to review it tomorrow/later this week

NGA-TRAN commented 2 months ago

I am reviewing this now

suremarc commented 2 months ago

@NGA-TRAN I added a sqllogictest and fixed some bugs in the implementation. CI seems to be passing now

NGA-TRAN commented 2 months ago

Thanks @suremarc . I am still in the middle of the review. Sorry it takes more time. So far the code looks really good

suremarc commented 2 months ago

@alamb I won't be able to get to this until tomorrow, just letting you know

alamb commented 2 months ago

@alamb I won't be able to get to this until tomorrow, just letting you know

No worries - thanks for doing it @suremarc -- I am traveling this week so I won't have time to review until later this week.

Thank you so much

alamb commented 2 months ago

I think this PR is quite nice and it would be great if we could get the tests written and the code polished up.

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

alamb commented 1 month ago

@NGA-TRAN do you have time to review this PR as well?

alamb commented 1 month ago

No worries @suremarc -- I am very excited about this PR. I plan to review it sometime this week (hopefully later today)

NGA-TRAN commented 1 month ago

I will review this either today or tomorrow

suremarc commented 1 month ago

I'm not sure why changing a comment caused the tests to start failing.... oof.

alamb commented 1 month ago

Added API change label as I it adds a new field to PartitionedFile

suremarc commented 1 month ago

@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.

alamb commented 1 month ago

@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.

Sorry -- sounds good. I am going to give this PR another look and file some follow on tickets.

alamb commented 1 month ago

Filed https://github.com/apache/datafusion/issues/10336 to track enable this flag by default