apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Check predicate and values are the same length for run end array filter safety #6675

Closed delamarch3 closed 2 weeks ago

delamarch3 commented 2 weeks ago

Which issue does this PR close?

Closes #6569

Rationale for this change

It's possible for a filter predicate with less values than the target array to be passed into filter_run_end_array which will result in an out of bounds index on the filter values, as described in the issue.

What changes are included in this PR?

I thought that it would be inconsistent with the other filters to have this one return an error if the lengths do not match. Eg for an int or string array filter if there are less filter values than array values then those extra values are filtered out. I replaced the unchecked filter value with a skip/take iterator.

Are there any user-facing changes?

An error is returned if the predicate and values do not have the same length No

tustvold commented 2 weeks ago

There is a potential that this regresses performance slightly, however, I don't believe much effort has been expended optimising this kernel for RunArray, especially given there are no benchmarks, so this is probably fine.

delamarch3 commented 2 weeks ago

Thanks @tustvold. I ran some benchmarks on this and it turns out it is quite a bit slower. I've changed it slightly so it just subtracts the difference which seems to have brought it back closer to where it was, I can open a new PR? Here's the benchmark results:

This PR:

run_ends_len = 64       time:   [3.9878 µs 4.0035 µs 4.0204 µs]
                        change: [-1.3684% -0.4485% +0.5210%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

run_ends_len = 512      time:   [56.654 µs 57.076 µs 57.458 µs]
                        change: [-1.7583% -1.0124% -0.2300%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

run_ends_len = 1024     time:   [151.37 µs 151.83 µs 152.34 µs]
                        change: [-4.5739% -3.1197% -1.6545%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe

My updates:

run_ends_len = 64       time:   [2.7373 µs 2.8041 µs 2.8841 µs]
                        change: [-32.545% -31.650% -30.574%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) high mild
  8 (8.00%) high severe

run_ends_len = 512      time:   [4.5291 µs 4.5413 µs 4.5548 µs]
                        change: [-91.973% -91.902% -91.821%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe

run_ends_len = 1024     time:   [6.3687 µs 6.4206 µs 6.4869 µs]
                        change: [-95.802% -95.750% -95.693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe