facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.27k stars 1.08k forks source link

The performance of right semi could be not as good as left semi #9980

Open rui-mo opened 1 month ago

rui-mo commented 1 month ago

Description

In TPC-DS q14a/q14b, there is a left semi join with a right build side, where the data size of right table is 10G and the left is 3M. Given that a LEFT b == b RIGHT a, we switched to using a right semi join in Velox to build on the smaller table in order to lessen the hash build's memory pressure. However, with this change, we observe a regression in performance from 30s to 300s. This leads us to think about the performance of right semi join.

In https://github.com/rui-mo/velox/commit/e05691a6648ebd5a540ef3b3baae8868a48b38fd, we added several tests using various build types but the same left and right tables. Here is the performance data we obtained. These data show that the left and right outer joins perform similarly, but the right semi join is much slower than the left semi join.

ParquetTableScanTest.leftSemiJoin (74 ms) ParquetTableScanTest.rightSemiJoin (655 ms) ParquetTableScanTest.leftOuterJoin (2081 ms) ParquetTableScanTest.rightOuterJoin (2096 ms)

We gathered the right semi join performance record, which indicates that setProbedFlag is a hotspot.

   40.01%  CPUThreadPool0   velox_dwio_parquet_table_scan_test  [.] facebook::velox::exec::RowContainer::setProbedFlag
   5.77%  CPUThreadPool0   libc-2.31.so                        [.] __memcpy_avx_unaligned_erms
   5.15%  CPUThreadPool0   [vdso]                              [.] __vdso_clock_gettime
   4.62%  CPUThreadPool0   velox_dwio_parquet_table_scan_test  [.] facebook::velox::exec::OpCallStatus::start
   2.63%  CPUThreadPool0   velox_dwio_parquet_table_scan_test  [.] facebook::velox::exec::Driver::runInternal
   2.53%  CPUThreadPool0   velox_dwio_parquet_table_scan_test  [.] facebook::velox::exec::HashTable<true>::listJoinResults
   1.84%  CPUThreadPool0   velox_dwio_parquet_table_scan_test  [.] facebook::velox::exec::OpCallStatus::stop
   1.83%  CPUThreadPool0   velox_dwio_parquet_table_scan_test  [.] facebook::velox::ExceptionContextSetter::ExceptionContextSetter
rui-mo commented 1 month ago

@mbasmanova Would you like to give us some inputs here? Thanks.

rui-mo commented 1 month ago

cc: @FelixYBW @zhouyuan @zml1206

FelixYBW commented 4 weeks ago

@pedroerp

Yuhta commented 3 weeks ago

setProbedFlag is a slow operation due to its random memory access pattern. To accelerate this use case, we could move them to a separate bit mask array in hash table. @oerling @mbasmanova @xiaoxmeng What do you think?

mbasmanova commented 3 weeks ago

I wonder if it would be faster to set 'probed' flag during probing. I assume we load the 'row' anyway to compare join keys, so we already have the data in memory. This can help at least for right / full joins without the extra filter.

Yuhta commented 3 weeks ago

@mbasmanova This will work for non array hash mode. For array hash mode (which is probably the case here) we still need to find another way to accelerate it.