Improve Spark comparison functions performance by auto-vectorization

Description

Add __restrict annotations on the inputs of Spark comparison fucntions to aid in auto-vectorization. This is also applicable to types like timestamp, decimal, etc.

velox_sparksql_benchmarks_simd_compare Benchmark code: https://github.com/facebookincubator/velox/pull/10273/files#diff-8d736f6738d1f0ce8c70df4e90c3501b64f14375faffe9024bb142ed87ef83c8. Benchmark result before this change:

============================================================================
[...]l/benchmarks/SIMDCompareBenchmark.cpp     relative  time/iter   iters/s
============================================================================
tinyint_greaterthan                                         4.32ms    231.43
tinyint_greaterthanorequal                                  3.88ms    257.48
tinyint_equalto                                             1.80ms    556.66
int_greaterthan                                             3.96ms    252.53
int_greaterthanorequal                                      4.27ms    234.26
int_equalto                                                 1.66ms    601.41
bigint_greaterthan                                          3.95ms    253.24
bigint_greaterthanorequal                                   3.97ms    251.74
bigint_equalto                                              1.70ms    588.51
timestamp_greaterthan                                       4.36ms    229.25
timestamp_greaterthanorequal                                4.55ms    219.70
timestamp_equalto                                           1.70ms    588.62
double_greaterthan                                          4.33ms    230.96
double_greaterthanorequal                                   4.43ms    225.89
double_equalto                                              1.69ms    590.61
varchar_greaterthan                                         6.78ms    147.42
varchar_greaterthanorequal                                  7.24ms    138.04
varchar_equalto                                             1.66ms    600.82
tinyint_greaterthanorequal_partial_selected                 2.89ms    346.16
int_greaterthanorequal_partial_selected                     2.88ms    346.96
bigint_greaterthanorequal_partial_selected                  2.69ms    371.33
timestamp_greaterthanorequal_partial_selected               2.91ms    343.95
double_greaterthanorequal_partial_selected                  3.11ms    321.15
varchar_greaterthanorequal_partial_selected                 5.56ms    180.01

Benchmake result after this change:

============================================================================
[...]l/benchmarks/SIMDCompareBenchmark.cpp     relative  time/iter   iters/s
============================================================================
tinyint_greaterthan                                        56.94us    17.56K
tinyint_greaterthanorequal                                 56.58us    17.67K
tinyint_equalto                                            57.47us    17.40K
int_greaterthan                                           100.00us    10.00K
int_greaterthanorequal                                    100.02us    10.00K
int_equalto                                               100.26us     9.97K
bigint_greaterthan                                        168.61us     5.93K
bigint_greaterthanorequal                                 168.90us     5.92K
bigint_equalto                                            152.70us     6.55K
timestamp_greaterthan                                     865.50us     1.16K
timestamp_greaterthanorequal                                1.02ms    978.55
timestamp_equalto                                         408.13us     2.45K
double_greaterthan                                        375.12us     2.67K
double_greaterthanorequal                                 471.73us     2.12K
double_equalto                                            797.74us     1.25K
varchar_greaterthan                                         1.63ms    612.34
varchar_greaterthanorequal                                  5.46ms    183.28
varchar_equalto                                           302.97us     3.30K
tinyint_greaterthanorequal_partial_selected               562.24us     1.78K
int_greaterthanorequal_partial_selected                   533.95us     1.87K
bigint_greaterthanorequal_partial_selected                536.33us     1.86K
timestamp_greaterthanorequal_partial_selected               1.31ms    766.09
double_greaterthanorequal_partial_selected                992.68us     1.01K
varchar_greaterthanorequal_partial_selected                 4.34ms    230.49

velox_sparksql_benchmarks_compare Benchmark result before this change:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
compare##gt                                               437.93us     2.28K
compare##gt_with_cast                                     868.98us     1.15K
compare##gte                                              436.48us     2.29K
compare##gte_with_cast                                    866.63us     1.15K
compare##lt                                               447.94us     2.23K
compare##lt_with_cast                                     873.71us     1.14K
compare##lte                                              446.68us     2.24K
compare##lte_with_cast                                    871.41us     1.15K
compare##eq                                               430.99us     2.32K
compare##eq_with_cast                                     866.67us     1.15K
compare##neq                                              432.56us     2.31K
compare##neq_with_cast                                    754.15us     1.33K
----------------------------------------------------------------------------

Benchmake result after this change:

============================================================================
[...]hmarks/ExpressionBenchmarkBuilder.cpp     relative  time/iter   iters/s
============================================================================
compare##gt                                               209.08us     4.78K
compare##gt_with_cast                                     609.70us     1.64K
compare##gte                                              205.40us     4.87K
compare##gte_with_cast                                    609.73us     1.64K
compare##lt                                               202.81us     4.93K
compare##lt_with_cast                                     615.84us     1.62K
compare##lte                                              209.12us     4.78K
compare##lte_with_cast                                    611.13us     1.64K
compare##eq                                               205.20us     4.87K
compare##eq_with_cast                                     609.77us     1.64K
compare##neq                                              203.66us     4.91K
compare##neq_with_cast                                    638.76us     1.57K
----------------------------------------------------------------------------

We can observe a measurable speed-up with this change: the average performance gain is about (256 / size of type) times when all rows are selected. For example, for the timestamp type, there is about a 2X performance gain. When only some rows are selected, there are extra checks for row selection, but the result is still much better than before this optimization. For example, int_greaterthanorequal_partial_selected shows about a 6X gain.

The above tests show that auto-vectorization can be applied to both varchar and timestamp types without needing to write specific SIMD instructions (this could be a very complex task). It also shows that this can improve performance for cases where all rows are selected and for cases where rows are partially selected (we set the selection ratio to 75% for test cases with names ending in _partial_selected, which is also the threshold we use in the code).

We note the slowness for the following expression: A > C AND B <= C. All A, B, and C are timestamp type. This is the reason we looked at this part. We considered leveraging some SIMD instructions to improve the timestamp type comparison function, but it's challenging for me. BTW, I did an experiment (velox_functions_prestosql_benchmarks_comparisons) to compare the performance difference between auto-vectorization and explicit-vectorization. The results showed that the performance of explicit vectorization is better, but the difference is not very large. velox_functions_prestosql_benchmarks_comparisons Explicit vectorization(Original, main branch)

============================================================================
[...]l/benchmarks/ComparisonsBenchmark.cpp     relative  time/iter   iters/s
============================================================================
non_simd_bigint_eq                                        937.04us     1.07K
simd_bigint_eq                                  650.07%   144.14us     6.94K
non_simd_integer_eq                                       974.05us     1.03K
simd_integer_eq                                 944.07%   103.18us     9.69K
non_simd_smallint_eq                                      926.33us     1.08K
simd_smallint_eq                                1279.6%    72.39us    13.81K
non_simd_tinyint_eq                                       948.58us     1.05K
simd_tinyint_eq                                 1777.1%    53.38us    18.73K
non_simd_double_eq                                        936.13us     1.07K
simd_double_eq                                  742.26%   126.12us     7.93K
non_simd_real_eq                                          925.45us     1.08K
simd_real_eq                                    890.76%   103.89us     9.63K
non_simd_date_eq                                          927.94us     1.08K
simd_date_eq                                    890.55%   104.20us     9.60K
non_simd_interval_day_time_eq                             933.93us     1.07K
simd_interval_day_time_eq                       644.91%   144.81us     6.91K
non_simd_interval_year_month_eq                           976.90us     1.02K
simd_interval_year_month_eq                     946.18%   103.25us     9.69K

Auto-vectorization: https://github.com/zhli1142015/velox/commit/59de096ec8c6f81334fa71471dc98353a8c415dc

============================================================================
[...]l/benchmarks/ComparisonsBenchmark.cpp     relative  time/iter   iters/s
============================================================================
non_simd_bigint_eq                                        937.32us     1.07K
simd_bigint_eq                                  581.36%   161.23us     6.20K
non_simd_integer_eq                                       928.92us     1.08K
simd_integer_eq                                 859.04%   108.13us     9.25K
non_simd_smallint_eq                                      925.98us     1.08K
simd_smallint_eq                                1114.0%    83.12us    12.03K
non_simd_tinyint_eq                                       947.64us     1.06K
simd_tinyint_eq                                 1450.2%    65.34us    15.30K
non_simd_double_eq                                        933.40us     1.07K
simd_double_eq                                  616.65%   151.37us     6.61K
non_simd_real_eq                                          927.52us     1.08K
simd_real_eq                                    853.92%   108.62us     9.21K
non_simd_date_eq                                          930.21us     1.08K
simd_date_eq                                    814.95%   114.14us     8.76K
non_simd_interval_day_time_eq                             985.53us     1.01K
simd_interval_day_time_eq                       575.84%   171.15us     5.84K
non_simd_interval_year_month_eq                           942.41us     1.06K
simd_interval_year_month_eq                     823.95%   114.38us     8.74K

facebookincubator / velox

Improve Spark comparison functions performance by auto-vectorization #10277

Description