Closed suremarc closed 1 month ago
/benchmark
/benchmark
/benchmark
@suremarc sorry for the noise, just trying to run the benchmark command!
/benchmark
Running /benchmark once more to see if previous run was noise
Fwiw, I've noticed that it seem to have a minor systematic bias against the PR results in general (even if e.g. there are literally no code changes in it). This manifests like above, by reporting a couple of queries being marginally slower.
I think this will be alleviated once a dedicated runner is used, and/or when more benchmark types are included.
Yeah seems noise, different queries running slower in second run.
I will review this either today or tomorrow
@alamb @NGA-TRAN
Thanks for the swift action - I guess I'll go ahead and mark this as ready for review since it's being reviewed, haha.
I'm still planning to add an end-to-end test with ListingTable
but just haven't figured out how to do it yet
Sorry I could not get to this today. I will try to review it tomorrow/later this week
I am reviewing this now
@NGA-TRAN I added a sqllogictest and fixed some bugs in the implementation. CI seems to be passing now
Thanks @suremarc . I am still in the middle of the review. Sorry it takes more time. So far the code looks really good
@alamb I won't be able to get to this until tomorrow, just letting you know
@alamb I won't be able to get to this until tomorrow, just letting you know
No worries - thanks for doing it @suremarc -- I am traveling this week so I won't have time to review until later this week.
Thank you so much
I think this PR is quite nice and it would be great if we could get the tests written and the code polished up.
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look
@NGA-TRAN do you have time to review this PR as well?
No worries @suremarc -- I am very excited about this PR. I plan to review it sometime this week (hopefully later today)
I will review this either today or tomorrow
I'm not sure why changing a comment caused the tests to start failing.... oof.
Added API change label as I it adds a new field to PartitionedFile
@alamb I added a config value, and I moved MinMaxStatistics
to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.
@alamb I added a config value, and I moved MinMaxStatistics to its own module as requested. I wasn't sure if I should delay addressing your feedback on tests to the next PR, since it seems like the suggested plan is to merge this PR first.
Sorry -- sounds good. I am going to give this PR another look and file some follow on tickets.
Filed https://github.com/apache/datafusion/issues/10336 to track enable this flag by default
Which issue does this PR close?
Closes #7490 .
Rationale for this change
See details in #7490 - this feature helps DataFusion eliminate sorts when files can be shown to be non-overlapping in terms of min/max statistics.
What changes are included in this PR?
FileScanConfig::sort_file_groups
method that distribute files via a bin packing algorithm, ensuring that no two files have overlapping statisticsFileScanConfig::project
check if file groups are sorted when determining projected output orderingsMinMaxStatistics
struct that uses the Arrow Row API to efficiently sort & compare file statistics.Are these changes tested?
Yes - there is a unit test and a sqllogictest.
Are there any user-facing changes?
Yes - there is a new optional
statistics
field inPartitionedFile
, which is part of the proposal in #7490.There is also the new
FileScanConfig::sort_file_groups
API