Open zhli1142015 opened 2 weeks ago
cc @mbasmanova , @Yuhta , @FelixYBW , @rui-mo , could you please help to check this? https://github.com/facebookincubator/velox/pull/10273 Thanks.
It's not surprised to see big perf gain with SIMD. A better way is to use intrincs directly.
We note the slowness for the following expression: A > C AND B <= C. All A, B, and C are timestamp type. This is the reason we looked at this part. We considered leveraging some SIMD instructions to improve the timestamp type comparison function, but it's challenging for me.
BTW, I did an experiment (velox_functions_prestosql_benchmarks_comparisons
) to compare the performance difference between auto-vectorization and explicit-vectorization. The results showed that the performance of explicit vectorization is better, but the difference is not very large.
velox_functions_prestosql_benchmarks_comparisons
Explicit vectorization(Original, main branch)
============================================================================
[...]l/benchmarks/ComparisonsBenchmark.cpp relative time/iter iters/s
============================================================================
non_simd_bigint_eq 937.04us 1.07K
simd_bigint_eq 650.07% 144.14us 6.94K
non_simd_integer_eq 974.05us 1.03K
simd_integer_eq 944.07% 103.18us 9.69K
non_simd_smallint_eq 926.33us 1.08K
simd_smallint_eq 1279.6% 72.39us 13.81K
non_simd_tinyint_eq 948.58us 1.05K
simd_tinyint_eq 1777.1% 53.38us 18.73K
non_simd_double_eq 936.13us 1.07K
simd_double_eq 742.26% 126.12us 7.93K
non_simd_real_eq 925.45us 1.08K
simd_real_eq 890.76% 103.89us 9.63K
non_simd_date_eq 927.94us 1.08K
simd_date_eq 890.55% 104.20us 9.60K
non_simd_interval_day_time_eq 933.93us 1.07K
simd_interval_day_time_eq 644.91% 144.81us 6.91K
non_simd_interval_year_month_eq 976.90us 1.02K
simd_interval_year_month_eq 946.18% 103.25us 9.69K
Auto-vectorization: https://github.com/zhli1142015/velox/commit/59de096ec8c6f81334fa71471dc98353a8c415dc
============================================================================
[...]l/benchmarks/ComparisonsBenchmark.cpp relative time/iter iters/s
============================================================================
non_simd_bigint_eq 937.32us 1.07K
simd_bigint_eq 581.36% 161.23us 6.20K
non_simd_integer_eq 928.92us 1.08K
simd_integer_eq 859.04% 108.13us 9.25K
non_simd_smallint_eq 925.98us 1.08K
simd_smallint_eq 1114.0% 83.12us 12.03K
non_simd_tinyint_eq 947.64us 1.06K
simd_tinyint_eq 1450.2% 65.34us 15.30K
non_simd_double_eq 933.40us 1.07K
simd_double_eq 616.65% 151.37us 6.61K
non_simd_real_eq 927.52us 1.08K
simd_real_eq 853.92% 108.62us 9.21K
non_simd_date_eq 930.21us 1.08K
simd_date_eq 814.95% 114.14us 8.76K
non_simd_interval_day_time_eq 985.53us 1.01K
simd_interval_day_time_eq 575.84% 171.15us 5.84K
non_simd_interval_year_month_eq 942.41us 1.06K
simd_interval_year_month_eq 823.95% 114.38us 8.74K
Description
Add __restrict annotations on the inputs of Spark comparison fucntions to aid in auto-vectorization. This is also applicable to types like timestamp, decimal, etc.
velox_sparksql_benchmarks_simd_compare
Benchmark code: https://github.com/facebookincubator/velox/pull/10273/files#diff-8d736f6738d1f0ce8c70df4e90c3501b64f14375faffe9024bb142ed87ef83c8. Benchmark result before this change:Benchmake result after this change:
velox_sparksql_benchmarks_compare
Benchmark result before this change:Benchmake result after this change:
We can observe a measurable speed-up with this change: the average performance gain is about (256 / size of type) times when all rows are selected. For example, for the timestamp type, there is about a 2X performance gain. When only some rows are selected, there are extra checks for row selection, but the result is still much better than before this optimization. For example, int_greaterthanorequal_partial_selected shows about a 6X gain.
The above tests show that auto-vectorization can be applied to both varchar and timestamp types without needing to write specific SIMD instructions (this could be a very complex task). It also shows that this can improve performance for cases where all rows are selected and for cases where rows are partially selected (we set the selection ratio to 75% for test cases with names ending in _partial_selected, which is also the threshold we use in the code).