apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.63k stars 803 forks source link

Speed up `filter_bytes` #6699

Closed Dandandan closed 2 weeks ago

Dandandan commented 2 weeks ago

Which issue does this PR close?

MutableBuffer doesn't always generate optimal code compared to using Vec. I think the main performance difference might be in MutableBuffer::push (but not sure if that explains all the difference).

filter context string (kept 1/2)
                        time:   [385.65 µs 394.58 µs 405.71 µs]
                        change: [-52.835% -51.497% -50.056%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  13 (13.00%) high mild
  3 (3.00%) high severe

Benchmarking filter context string high selectivity (kept 1023/1024): Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.9s, enable flat sampling, or reduce sample count to 60.
filter context string high selectivity (kept 1023/1024)
                        time:   [1.1102 ms 1.1526 ms 1.1964 ms]
                        change: [-11.666% -8.8390% -5.7644%] (p = 0.00 < 0.05)
                        Performance has improved.

filter context string low selectivity (kept 1/1024)
                        time:   [979.52 ns 987.11 ns 994.64 ns]
                        change: [-19.093% -18.289% -17.438%] (p = 0.00 < 0.05)
                        Performance has improved.

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented 2 weeks ago

I also ran the benchmarks and see the same improvements 👍

++ critcmp master speedup_filter
group                                                                         master                                 speedup_filter
-----                                                                         ------                                 --------------
filter context mixed string view (kept 1/2)                                   1.00    674.8±6.62µs        ? ?/sec    1.01    679.6±6.81µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00   1196.4±2.54µs        ? ?/sec    1.02   1217.7±3.17µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   1172.9±0.95ns        ? ?/sec    1.00   1164.3±2.62ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    442.1±2.81µs        ? ?/sec    1.01    447.8±4.74µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00    783.1±1.14µs        ? ?/sec    1.01    790.9±3.73µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    936.5±1.46ns        ? ?/sec    1.01    947.4±5.04ns        ? ?/sec
filter context string (kept 1/2)                                              4.86      3.8±0.07ms        ? ?/sec    1.00   774.1±21.31µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00    244.2±1.43µs        ? ?/sec    1.00    243.6±1.24µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00    248.9±1.50µs        ? ?/sec    1.00    248.1±0.92µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    202.0±0.50µs        ? ?/sec    1.00    201.5±0.74µs        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00    341.8±0.75µs        ? ?/sec    1.00    342.5±0.93µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00    466.3±0.75µs        ? ?/sec    1.01    469.3±1.23µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00    101.1±0.22µs        ? ?/sec    1.00    100.8±0.23µs        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.63      2.5±0.06ms        ? ?/sec    1.00  1523.2±31.32µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.23   1993.4±3.93ns        ? ?/sec    1.00   1614.9±4.07ns        ? ?/sec
alamb commented 2 weeks ago

Thanks @Dandandan and @jayzhan211

Dandandan commented 2 weeks ago

I also ran the benchmarks and see the same improvements 👍

++ critcmp master speedup_filter
group                                                                         master                                 speedup_filter
-----                                                                         ------                                 --------------
filter context mixed string view (kept 1/2)                                   1.00    674.8±6.62µs        ? ?/sec    1.01    679.6±6.81µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00   1196.4±2.54µs        ? ?/sec    1.02   1217.7±3.17µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   1172.9±0.95ns        ? ?/sec    1.00   1164.3±2.62ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    442.1±2.81µs        ? ?/sec    1.01    447.8±4.74µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00    783.1±1.14µs        ? ?/sec    1.01    790.9±3.73µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.00    936.5±1.46ns        ? ?/sec    1.01    947.4±5.04ns        ? ?/sec
filter context string (kept 1/2)                                              4.86      3.8±0.07ms        ? ?/sec    1.00   774.1±21.31µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00    244.2±1.43µs        ? ?/sec    1.00    243.6±1.24µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.00    248.9±1.50µs        ? ?/sec    1.00    248.1±0.92µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    202.0±0.50µs        ? ?/sec    1.00    201.5±0.74µs        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00    341.8±0.75µs        ? ?/sec    1.00    342.5±0.93µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00    466.3±0.75µs        ? ?/sec    1.01    469.3±1.23µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00    101.1±0.22µs        ? ?/sec    1.00    100.8±0.23µs        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.63      2.5±0.06ms        ? ?/sec    1.00  1523.2±31.32µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.23   1993.4±3.93ns        ? ?/sec    1.00   1614.9±4.07ns        ? ?/sec

Wow that even seems a much bigger improvement 🤔

alamb commented 2 weeks ago

Wow that even seems a much bigger improvement 🤔

Will rerun and see if I can reprodice

alamb commented 2 weeks ago

Results of rerunning

++ critcmp master speedup_filter
group                                                                         master                                 speedup_filter
-----                                                                         ------                                 --------------
filter context mixed string view (kept 1/2)                                   1.00    676.9±1.46µs        ? ?/sec    1.00    680.2±3.73µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00   1195.0±3.18µs        ? ?/sec    1.02   1217.4±3.71µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   1167.2±2.60ns        ? ?/sec    1.00   1156.4±1.15ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00    441.9±3.60µs        ? ?/sec    1.02    451.0±2.40µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.00    780.6±2.07µs        ? ?/sec    1.01    790.6±1.75µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.01    930.2±2.89ns        ? ?/sec    1.00    918.1±4.38ns        ? ?/sec
filter context string (kept 1/2)                                              4.89      3.7±0.09ms        ? ?/sec    1.00   757.8±12.84µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.02    247.1±1.64µs        ? ?/sec    1.00    242.5±0.56µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.02    252.3±1.74µs        ? ?/sec    1.00    248.4±1.13µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    201.8±0.46µs        ? ?/sec    1.00    200.9±0.48µs        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00    341.7±0.77µs        ? ?/sec    1.00    341.9±0.52µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.00    466.6±1.22µs        ? ?/sec    1.01    469.1±3.93µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00    101.0±0.09µs        ? ?/sec    1.00    101.0±0.38µs        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.67      2.5±0.07ms        ? ?/sec    1.00  1491.5±32.49µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.23   1989.6±5.27ns        ? ?/sec    1.00   1622.9±4.10ns        ? ?/sec
Dandandan commented 2 weeks ago

Interesting, so it seems to have even bigger effect on other machines (maybe when branch misses have more impact or it can generate even better code) - my result is from M1 pro

alamb commented 2 weeks ago

Interesting, so it seems to have even bigger effect on other machines (maybe when branch misses have more impact or it can generate even better code) - my result is from M1 pro

This is what I am running on:

processor       : 15
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) CPU @ 3.10GHz
stepping        : 7
microcode       : 0xffffffff
cpu MHz         : 3100.310
cache size      : 25344 KB
physical id     : 0
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 15
initial apicid  : 15
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid ts\
c_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle \
avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs taa mmio_stale_data retbleed eibrs_pbrsb bhi
bogomips        : 6200.62
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
Dandandan commented 2 weeks ago

Ok, thanks for sharing!