delamarch3 commented 2 weeks ago

Which issue does this PR close?

Rationale for this change

After running some benchmarks it turned out that the changes introduced in my previous PR were quite slow. This should bring it back closer to how it was before.

run_ends_len = 64       time:   [3.9878 µs 4.0035 µs 4.0204 µs]
                        change: [-1.3684% -0.4485% +0.5210%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

run_ends_len = 512      time:   [56.654 µs 57.076 µs 57.458 µs]
                        change: [-1.7583% -1.0124% -0.2300%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

run_ends_len = 1024     time:   [151.37 µs 151.83 µs 152.34 µs]
                        change: [-4.5739% -3.1197% -1.6545%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

run_ends_len = 64       time:   [2.7373 µs 2.8041 µs 2.8841 µs]
                        change: [-32.545% -31.650% -30.574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

run_ends_len = 512      time:   [4.5291 µs 4.5413 µs 4.5548 µs]
                        change: [-91.973% -91.902% -91.821%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

run_ends_len = 1024     time:   [6.3687 µs 6.4206 µs 6.4869 µs]
                        change: [-95.802% -95.750% -95.693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

What changes are included in this PR?

I subtract the difference from end to keep it in bounds.

Are there any user-facing changes?

No

alamb commented 1 week ago

I am running the benchmarks on this PR to verify. Thank you @delamarch3

alamb commented 1 week ago

I apologize for being denise @delamarch3 -- but I spent a while trying to find the benchmarks you are running and I couldn't figure out which they were. Is it the filter_kernels?

delamarch3 commented 1 week ago

@alamb Sorry, I wrote a separate benchmark for this but I didn't commit it, it's not consistent with the results it returns on each run (the odd +/-5%) so I thought it needed work but there was enough of a difference to compare. I can add it into this PR though?

delamarch3 commented 1 week ago

I'll try to add one into filter_kernels

alamb commented 1 week ago

I'll try to add one into filter_kernels

Thank you -- if you could do so as a separate PR that would be most helpful (so it is easy to compare with these changes) 🙏

delamarch3 commented 1 week ago

I've run the filter_kernel benchmark I added for the run array in https://github.com/apache/arrow-rs/pull/6706 with the different approaches, here are the results I get:

for pred in filter_values
    .iter()
    .skip(start as usize)
    .take((end - start) as usize)
{
    count += R::Native::from(pred);
    keep |= pred
}

Benchmarking filter run array (kept 1/2): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 52.1s, or reduce sample count to 10.
filter run array (kept 1/2)
                        time:   [542.98 ms 549.50 ms 556.59 ms]
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

Benchmarking filter run array high selectivity (kept 1023/1024): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 54.3s, or reduce sample count to 10.
Benchmarking filter run array high selectivity (kept 1023/1024): Collecting 100 samples in estimated 54.256 s (100 iterations
filter run array high selectivity (kept 1023/1024)
                        time:   [550.25 ms 555.80 ms 561.74 ms]
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking filter run array low selectivity (kept 1/1024): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 53.5s, or reduce sample count to 10.
filter run array low selectivity (kept 1/1024)
                        time:   [536.14 ms 540.44 ms 545.14 ms]
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

for _ in start..end {
    if let Some(pred) = preds.next() {
        count += R::Native::from(pred);
        keep |= pred
    }
}

filter run array (kept 1/2)
                        time:   [598.70 µs 601.93 µs 605.25 µs]
                        change: [-99.892% -99.890% -99.889%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

Benchmarking filter run array high selectivity (kept 1023/1024): Collecting 100 samples in estimated 6.0573 s (15k iterations
filter run array high selectivity (kept 1023/1024)
                        time:   [386.55 µs 388.17 µs 389.91 µs]
                        change: [-99.931% -99.930% -99.929%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

filter run array low selectivity (kept 1/1024)
                        time:   [239.93 µs 240.46 µs 241.04 µs]
                        change: [-99.956% -99.955% -99.955%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe

These two are similar but after running a few times the low selectivity benchmark seems slightly faster in this one

end -= end.saturating_sub(filter_values.len() as u64);
for pred in (start..end).map(|i| unsafe { filter_values.value_unchecked(i as usize) }) {
    count += R::Native::from(pred);
    keep |= pred
}

filter run array (kept 1/2)
                        time:   [581.12 µs 584.01 µs 586.90 µs]
                        change: [-2.5195% -1.1178% +0.1036%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Benchmarking filter run array high selectivity (kept 1023/1024): Collecting 100 samples in estimated 5.5900 s (15k iterations
filter run array high selectivity (kept 1023/1024)
                        time:   [359.79 µs 361.40 µs 363.47 µs]
                        change: [-7.7904% -5.5816% -3.1503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

filter run array low selectivity (kept 1/1024)
                        time:   [209.87 µs 210.45 µs 211.09 µs]
                        change: [-13.950% -13.255% -12.616%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

Dandandan commented 1 week ago

for _ in start..end {
    if let Some(pred) = preds.next() {
        count += R::Native::from(pred);
        keep |= pred
    }
}

Maybe you could try something like (two minor changes)?

    for pred in preds.take(end-start) {
        count += R::Native::from(pred);
    }
    let keep = count > 0;

I think you can leave out the saturating_sub as well as long as you're not using unsafe? Otherwise you can try the unckeched variants as well for iter

delamarch3 commented 1 week ago

I'm not sure calling take on each iteration will work because it takes ownership of the Iterator so it needs to be reassigned on each run_ends iteration, but then it will always take on the first n elements, unless you had something else in mind? I did try BitIterator::new(filter_values.values(), start, end - start) to try achieve something similar but It turned out to be slower.

I had to create a new variable with the count for assigning keep after the loop, ie let keep = count > count_before, for the algorithm to still be correct, but this was also slower.

Dandandan commented 1 week ago

For me looks good - I feel somehow there probably should be more performance on the table.

I believe the reason is BitIterator is not faster is that currently it's doing the same as filter_values.value_unchecked so for using the iterator it is only the extra cost of the bounds check.

I think it should be possible to improve BitIterator (use bitshift rather than using the current index) and use unwrap_unchecked() to be slightly more performant than the current solution.

Dandandan commented 1 week ago

I'm not sure calling take on each iteration will work because it takes ownership of the Iterator so it needs to be reassigned on each run_ends iteration, but then it will always take on the first n elements, unless you had something else in mind? I did try BitIterator::new(filter_values.values(), start, end - start) to try achieve something similar but It turned out to be slower.

Of course, you're right.

Dandandan commented 1 week ago

Thanks @delamarch3

apache / arrow-rs

Undo run end filter performance regression #6691

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?