Closed Dandandan closed 2 weeks ago
I also ran the benchmarks and see the same improvements 👍
++ critcmp master speedup_filter
group master speedup_filter
----- ------ --------------
filter context mixed string view (kept 1/2) 1.00 674.8±6.62µs ? ?/sec 1.01 679.6±6.81µs ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024) 1.00 1196.4±2.54µs ? ?/sec 1.02 1217.7±3.17µs ? ?/sec
filter context mixed string view low selectivity (kept 1/1024) 1.01 1172.9±0.95ns ? ?/sec 1.00 1164.3±2.62ns ? ?/sec
filter context short string view (kept 1/2) 1.00 442.1±2.81µs ? ?/sec 1.01 447.8±4.74µs ? ?/sec
filter context short string view high selectivity (kept 1023/1024) 1.00 783.1±1.14µs ? ?/sec 1.01 790.9±3.73µs ? ?/sec
filter context short string view low selectivity (kept 1/1024) 1.00 936.5±1.46ns ? ?/sec 1.01 947.4±5.04ns ? ?/sec
filter context string (kept 1/2) 4.86 3.8±0.07ms ? ?/sec 1.00 774.1±21.31µs ? ?/sec
filter context string dictionary (kept 1/2) 1.00 244.2±1.43µs ? ?/sec 1.00 243.6±1.24µs ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024) 1.00 248.9±1.50µs ? ?/sec 1.00 248.1±0.92µs ? ?/sec
filter context string dictionary low selectivity (kept 1/1024) 1.00 202.0±0.50µs ? ?/sec 1.00 201.5±0.74µs ? ?/sec
filter context string dictionary w NULLs (kept 1/2) 1.00 341.8±0.75µs ? ?/sec 1.00 342.5±0.93µs ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024) 1.00 466.3±0.75µs ? ?/sec 1.01 469.3±1.23µs ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024) 1.00 101.1±0.22µs ? ?/sec 1.00 100.8±0.23µs ? ?/sec
filter context string high selectivity (kept 1023/1024) 1.63 2.5±0.06ms ? ?/sec 1.00 1523.2±31.32µs ? ?/sec
filter context string low selectivity (kept 1/1024) 1.23 1993.4±3.93ns ? ?/sec 1.00 1614.9±4.07ns ? ?/sec
Thanks @Dandandan and @jayzhan211
I also ran the benchmarks and see the same improvements 👍
++ critcmp master speedup_filter group master speedup_filter ----- ------ -------------- filter context mixed string view (kept 1/2) 1.00 674.8±6.62µs ? ?/sec 1.01 679.6±6.81µs ? ?/sec filter context mixed string view high selectivity (kept 1023/1024) 1.00 1196.4±2.54µs ? ?/sec 1.02 1217.7±3.17µs ? ?/sec filter context mixed string view low selectivity (kept 1/1024) 1.01 1172.9±0.95ns ? ?/sec 1.00 1164.3±2.62ns ? ?/sec filter context short string view (kept 1/2) 1.00 442.1±2.81µs ? ?/sec 1.01 447.8±4.74µs ? ?/sec filter context short string view high selectivity (kept 1023/1024) 1.00 783.1±1.14µs ? ?/sec 1.01 790.9±3.73µs ? ?/sec filter context short string view low selectivity (kept 1/1024) 1.00 936.5±1.46ns ? ?/sec 1.01 947.4±5.04ns ? ?/sec filter context string (kept 1/2) 4.86 3.8±0.07ms ? ?/sec 1.00 774.1±21.31µs ? ?/sec filter context string dictionary (kept 1/2) 1.00 244.2±1.43µs ? ?/sec 1.00 243.6±1.24µs ? ?/sec filter context string dictionary high selectivity (kept 1023/1024) 1.00 248.9±1.50µs ? ?/sec 1.00 248.1±0.92µs ? ?/sec filter context string dictionary low selectivity (kept 1/1024) 1.00 202.0±0.50µs ? ?/sec 1.00 201.5±0.74µs ? ?/sec filter context string dictionary w NULLs (kept 1/2) 1.00 341.8±0.75µs ? ?/sec 1.00 342.5±0.93µs ? ?/sec filter context string dictionary w NULLs high selectivity (kept 1023/1024) 1.00 466.3±0.75µs ? ?/sec 1.01 469.3±1.23µs ? ?/sec filter context string dictionary w NULLs low selectivity (kept 1/1024) 1.00 101.1±0.22µs ? ?/sec 1.00 100.8±0.23µs ? ?/sec filter context string high selectivity (kept 1023/1024) 1.63 2.5±0.06ms ? ?/sec 1.00 1523.2±31.32µs ? ?/sec filter context string low selectivity (kept 1/1024) 1.23 1993.4±3.93ns ? ?/sec 1.00 1614.9±4.07ns ? ?/sec
Wow that even seems a much bigger improvement 🤔
Wow that even seems a much bigger improvement 🤔
Will rerun and see if I can reprodice
Results of rerunning
++ critcmp master speedup_filter
group master speedup_filter
----- ------ --------------
filter context mixed string view (kept 1/2) 1.00 676.9±1.46µs ? ?/sec 1.00 680.2±3.73µs ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024) 1.00 1195.0±3.18µs ? ?/sec 1.02 1217.4±3.71µs ? ?/sec
filter context mixed string view low selectivity (kept 1/1024) 1.01 1167.2±2.60ns ? ?/sec 1.00 1156.4±1.15ns ? ?/sec
filter context short string view (kept 1/2) 1.00 441.9±3.60µs ? ?/sec 1.02 451.0±2.40µs ? ?/sec
filter context short string view high selectivity (kept 1023/1024) 1.00 780.6±2.07µs ? ?/sec 1.01 790.6±1.75µs ? ?/sec
filter context short string view low selectivity (kept 1/1024) 1.01 930.2±2.89ns ? ?/sec 1.00 918.1±4.38ns ? ?/sec
filter context string (kept 1/2) 4.89 3.7±0.09ms ? ?/sec 1.00 757.8±12.84µs ? ?/sec
filter context string dictionary (kept 1/2) 1.02 247.1±1.64µs ? ?/sec 1.00 242.5±0.56µs ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024) 1.02 252.3±1.74µs ? ?/sec 1.00 248.4±1.13µs ? ?/sec
filter context string dictionary low selectivity (kept 1/1024) 1.00 201.8±0.46µs ? ?/sec 1.00 200.9±0.48µs ? ?/sec
filter context string dictionary w NULLs (kept 1/2) 1.00 341.7±0.77µs ? ?/sec 1.00 341.9±0.52µs ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024) 1.00 466.6±1.22µs ? ?/sec 1.01 469.1±3.93µs ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024) 1.00 101.0±0.09µs ? ?/sec 1.00 101.0±0.38µs ? ?/sec
filter context string high selectivity (kept 1023/1024) 1.67 2.5±0.07ms ? ?/sec 1.00 1491.5±32.49µs ? ?/sec
filter context string low selectivity (kept 1/1024) 1.23 1989.6±5.27ns ? ?/sec 1.00 1622.9±4.10ns ? ?/sec
Interesting, so it seems to have even bigger effect on other machines (maybe when branch misses have more impact or it can generate even better code) - my result is from M1 pro
Interesting, so it seems to have even bigger effect on other machines (maybe when branch misses have more impact or it can generate even better code) - my result is from M1 pro
This is what I am running on:
processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) CPU @ 3.10GHz
stepping : 7
microcode : 0xffffffff
cpu MHz : 3100.310
cache size : 25344 KB
physical id : 0
siblings : 16
core id : 7
cpu cores : 8
apicid : 15
initial apicid : 15
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid ts\
c_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle \
avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs taa mmio_stale_data retbleed eibrs_pbrsb bhi
bogomips : 6200.62
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
Ok, thanks for sharing!
Which issue does this PR close?
MutableBuffer
doesn't always generate optimal code compared to usingVec
. I think the main performance difference might be inMutableBuffer::push
(but not sure if that explains all the difference).Closes #.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?