Closed delamarch3 closed 1 week ago
I am running the benchmarks on this PR to verify. Thank you @delamarch3
I apologize for being denise @delamarch3 -- but I spent a while trying to find the benchmarks you are running and I couldn't figure out which they were. Is it the filter_kernels
?
@alamb Sorry, I wrote a separate benchmark for this but I didn't commit it, it's not consistent with the results it returns on each run (the odd +/-5%) so I thought it needed work but there was enough of a difference to compare. I can add it into this PR though?
I'll try to add one into filter_kernels
I'll try to add one into
filter_kernels
Thank you -- if you could do so as a separate PR that would be most helpful (so it is easy to compare with these changes) 🙏
I've run the filter_kernel
benchmark I added for the run array in https://github.com/apache/arrow-rs/pull/6706 with the different approaches, here are the results I get:
for pred in filter_values
.iter()
.skip(start as usize)
.take((end - start) as usize)
{
count += R::Native::from(pred);
keep |= pred
}
Benchmarking filter run array (kept 1/2): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 52.1s, or reduce sample count to 10.
filter run array (kept 1/2)
time: [542.98 ms 549.50 ms 556.59 ms]
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
Benchmarking filter run array high selectivity (kept 1023/1024): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 54.3s, or reduce sample count to 10.
Benchmarking filter run array high selectivity (kept 1023/1024): Collecting 100 samples in estimated 54.256 s (100 iterations
filter run array high selectivity (kept 1023/1024)
time: [550.25 ms 555.80 ms 561.74 ms]
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
Benchmarking filter run array low selectivity (kept 1/1024): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 53.5s, or reduce sample count to 10.
filter run array low selectivity (kept 1/1024)
time: [536.14 ms 540.44 ms 545.14 ms]
Found 11 outliers among 100 measurements (11.00%)
6 (6.00%) high mild
5 (5.00%) high severe
for _ in start..end {
if let Some(pred) = preds.next() {
count += R::Native::from(pred);
keep |= pred
}
}
filter run array (kept 1/2)
time: [598.70 µs 601.93 µs 605.25 µs]
change: [-99.892% -99.890% -99.889%] (p = 0.00 < 0.05)
Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high severe
Benchmarking filter run array high selectivity (kept 1023/1024): Collecting 100 samples in estimated 6.0573 s (15k iterations
filter run array high selectivity (kept 1023/1024)
time: [386.55 µs 388.17 µs 389.91 µs]
change: [-99.931% -99.930% -99.929%] (p = 0.00 < 0.05)
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
filter run array low selectivity (kept 1/1024)
time: [239.93 µs 240.46 µs 241.04 µs]
change: [-99.956% -99.955% -99.955%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
6 (6.00%) high mild
6 (6.00%) high severe
These two are similar but after running a few times the low selectivity benchmark seems slightly faster in this one
end -= end.saturating_sub(filter_values.len() as u64);
for pred in (start..end).map(|i| unsafe { filter_values.value_unchecked(i as usize) }) {
count += R::Native::from(pred);
keep |= pred
}
filter run array (kept 1/2)
time: [581.12 µs 584.01 µs 586.90 µs]
change: [-2.5195% -1.1178% +0.1036%] (p = 0.11 > 0.05)
No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe
Benchmarking filter run array high selectivity (kept 1023/1024): Collecting 100 samples in estimated 5.5900 s (15k iterations
filter run array high selectivity (kept 1023/1024)
time: [359.79 µs 361.40 µs 363.47 µs]
change: [-7.7904% -5.5816% -3.1503%] (p = 0.00 < 0.05)
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
3 (3.00%) high mild
11 (11.00%) high severe
filter run array low selectivity (kept 1/1024)
time: [209.87 µs 210.45 µs 211.09 µs]
change: [-13.950% -13.255% -12.616%] (p = 0.00 < 0.05)
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
for _ in start..end {
if let Some(pred) = preds.next() {
count += R::Native::from(pred);
keep |= pred
}
}
Maybe you could try something like (two minor changes)?
for pred in preds.take(end-start) {
count += R::Native::from(pred);
}
let keep = count > 0;
I think you can leave out the saturating_sub
as well as long as you're not using unsafe? Otherwise you can try the unckeched variants as well for iter
I'm not sure calling take
on each iteration will work because it takes ownership of the Iterator
so it needs to be reassigned on each run_ends
iteration, but then it will always take on the first n elements, unless you had something else in mind? I did try BitIterator::new(filter_values.values(), start, end - start)
to try achieve something similar but It turned out to be slower.
I had to create a new variable with the count for assigning keep
after the loop, ie let keep = count > count_before
, for the algorithm to still be correct, but this was also slower.
For me looks good - I feel somehow there probably should be more performance on the table.
I believe the reason is BitIterator
is not faster is that currently it's doing the same as filter_values.value_unchecked
so for using the iterator it is only the extra cost of the bounds check.
I think it should be possible to improve BitIterator
(use bitshift rather than using the current index) and use unwrap_unchecked()
to be slightly more performant than the current solution.
I'm not sure calling
take
on each iteration will work because it takes ownership of theIterator
so it needs to be reassigned on eachrun_ends
iteration, but then it will always take on the first n elements, unless you had something else in mind? I did tryBitIterator::new(filter_values.values(), start, end - start)
to try achieve something similar but It turned out to be slower.
Of course, you're right.
Thanks @delamarch3
Which issue does this PR close?
Related to https://github.com/apache/arrow-rs/pull/6675
Rationale for this change
After running some benchmarks it turned out that the changes introduced in my previous PR were quite slow. This should bring it back closer to how it was before.
What changes are included in this PR?
I subtract the difference from
end
to keep it in bounds.Are there any user-facing changes?
No