Improve Fast C++ Bit Unpacking

yyang52 commented 1 year ago

Description

RLE/Bitpacking encoding provides high efficiency for Parquet and is heavily used. When reading parquet files, it's important to do a fast path bit unpacking to improve the efficiency. For now Velox has a fast C++ Bit Unpacking library with AVX2 and BMI2 instructions. While with AVX512's VBMI and other instructions, the performance could be improved further (our initial benchmark shows up to 1.6X boost compared to current AVX2/BMI2 implementation with clang complier). As more and more platforms will be supporting avx512 instructions, maybe we could do some further optimizations with that.

pedroerp commented 1 year ago

Cc: @yingsu00 @oerling @Yuhta who did many experiments with bit unpacking. Did we end up testing avx 512?

Yuhta commented 1 year ago

@pedroerp Not as far as I see. To experiment this, we need to

Fix RleBpDecoder for multibit levels first
Enhance RleBpDecoder to see if an extra AVX512 implementation outweighs the complexity added

pedroerp commented 1 year ago

Thanks @Yuhta.

@yyang52 if you're willing to work on this, I think the first step here would be to put a small micro-benchmark together to see whether bit unpacking would actually go faster with avx 512. We tried avx512 in the past in a few other parts of the code, and the results were not as fast as expected. I remember hearing that avx 512 might end up reducing the clock in some architectures. Still, would be interesting to benchmark.

yyang52 commented 1 year ago

Thanks @pedroerp @Yuhta for your feedback.

We surely did some micro-benchmark to compare current AVX2/BMI2 solution with our AVX-512 optimized solution, and got the initial results as below (The benchmark setup is basically the same with BitpackDecoderBenchmark.cpp, with clang-12 compiled, on ICX with CPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz):

=======================================================================================
[...]/tests/BitPackDecoderBenchmarknew.cpp     time/iter   iters/s   Speedup ratio (AVX-512 vs. AVX2)
=======================================================================================
velox_unpack_fullrows_switch_1_8                412.28us     2.43K
avx512_new_unpack_fullrows_switch_1_8           366.30us     2.73K  1.13x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_2_8                431.92us     2.32K
avx512_new_unpack_fullrows_switch_2_8           373.58us     2.68K  1.16x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_3_8                474.15us     2.11K
avx512_new_unpack_fullrows_switch_3_8           412.88us     2.42K  1.15x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_4_8                492.42us     2.03K
avx512_new_unpack_fullrows_switch_4_8           441.36us     2.27K  1.12x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_5_8                514.35us     1.94K
avx512_new_unpack_fullrows_switch_5_8           477.22us     2.10K  1.08x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_6_8                542.78us     1.84K
avx512_new_unpack_fullrows_switch_6_8           511.16us     1.96K  1.06x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_7_8                577.04us     1.73K
avx512_new_unpack_fullrows_switch_7_8           546.63us     1.83K  1.06x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_8_8                583.99us     1.71K
avx512_new_unpack_fullrows_switch_8_8           487.49us     2.05K  1.20x
----------------------------------------------------------------------------

----------------------------------------------------------------------------
velox_unpack_fullrows_switch_1_16               791.99us     1.26K
avx512_new_unpack_fullrows_switch_1_16          782.97us     1.28K  1.01x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_2_16               806.32us     1.24K
avx512_new_unpack_fullrows_switch_2_16          687.39us     1.45K  1.17x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_3_16               858.88us     1.16K
avx512_new_unpack_fullrows_switch_3_16          731.05us     1.37K  1.17x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_4_16               915.86us     1.09K
avx512_new_unpack_fullrows_switch_4_16          771.22us     1.30K  1.19x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_5_16               936.07us     1.07K
avx512_new_unpack_fullrows_switch_5_16          811.02us     1.23K  1.15x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_6_16               989.18us     1.01K
avx512_new_unpack_fullrows_switch_6_16          848.66us     1.18K  1.17x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_7_16                 1.13ms    882.83
avx512_new_unpack_fullrows_switch_7_16          904.03us     1.11K  1.25x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_8_16                 1.14ms    876.66
avx512_new_unpack_fullrows_switch_8_16            1.13ms    883.98  1.01x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_9_16                 1.41ms    711.71
avx512_new_unpack_fullrows_switch_9_16            1.03ms    972.76  1.37x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_10_16                1.38ms    723.75
avx512_new_unpack_fullrows_switch_10_16           1.07ms    932.85  1.29x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_11_16                1.75ms    572.91
avx512_new_unpack_fullrows_switch_11_16           1.13ms    885.27  1.55x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_12_16                1.52ms    659.99
avx512_new_unpack_fullrows_switch_12_16           1.19ms    842.70  1.28x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_13_16                2.06ms    484.73
avx512_new_unpack_fullrows_switch_13_16           1.25ms    802.24  1.66x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_14_16                1.89ms    529.41
avx512_new_unpack_fullrows_switch_14_16           1.31ms    764.70  1.44x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_15_16                1.83ms    547.45
avx512_new_unpack_fullrows_switch_15_16           1.37ms    728.07  1.33x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_16_16              980.40us     1.02K
avx512_new_unpack_fullrows_switch_16_16         979.07us     1.02K  1.00x
----------------------------------------------------------------------------

----------------------------------------------------------------------------
velox_unpack_fullrows_switch_1_32               2.36ms    423.05
avx512_new_unpack_fullrows_switch_1_32          2.12ms    472.22    1.12x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_2_32               2.45ms    408.24
avx512_new_unpack_fullrows_switch_2_32          1.67ms    598.45    1.47x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_3_32               2.50ms    400.42
avx512_new_unpack_fullrows_switch_3_32          1.75ms    571.69    1.43x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_4_32               2.66ms    376.17
avx512_new_unpack_fullrows_switch_4_32          1.81ms    551.54    1.47x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_5_32               2.75ms    363.89
avx512_new_unpack_fullrows_switch_5_32          2.06ms    485.34    1.33x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_6_32               2.82ms    354.63
avx512_new_unpack_fullrows_switch_6_32          2.15ms    465.99    1.31x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_7_32               2.78ms    359.78
avx512_new_unpack_fullrows_switch_7_32          2.30ms    435.36    1.21x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_8_32               2.90ms    344.44
avx512_new_unpack_fullrows_switch_8_32          2.15ms    465.08    1.35x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_9_32               2.83ms    353.60
avx512_new_unpack_fullrows_switch_9_32          2.31ms    432.96    1.22x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_10_32              2.90ms    345.37
avx512_new_unpack_fullrows_switch_10_32         2.40ms    415.88    1.20x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_11_32              2.97ms    336.26
avx512_new_unpack_fullrows_switch_11_32         2.51ms    398.29    1.18x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_13_32              3.09ms    323.69
avx512_new_unpack_fullrows_switch_13_32         2.74ms    365.63    1.13x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_15_32              3.24ms    308.36
avx512_new_unpack_fullrows_switch_15_32         2.91ms    343.71    1.11x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_17_32              3.60ms    277.60
avx512_new_unpack_fullrows_switch_17_32         3.05ms    327.52    1.18x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_19_32              3.83ms    260.80
avx512_new_unpack_fullrows_switch_19_32         3.24ms    308.34    1.18x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_21_32              3.97ms    251.90
avx512_new_unpack_fullrows_switch_21_32         3.44ms    290.41    1.15x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_24_32              4.15ms    240.89
avx512_new_unpack_fullrows_switch_24_32         3.74ms    267.19    1.11x
----------------------------------------------------------------------------    
velox_unpack_fullrows_switch_28_32              4.56ms    219.46
avx512_new_unpack_fullrows_switch_28_32         4.13ms    241.99    1.10x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_30_32              4.74ms    211.18
avx512_new_unpack_fullrows_switch_30_32         4.32ms    231.39    1.10x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_32_32              2.57ms    389.45
avx512_new_unpack_fullrows_switch_32_32         2.52ms    396.50    1.02x
----------------------------------------------------------------------------

And regards to the concern of AVX-512 might reduce the clock in some architectures, it might not be a big drawback because the optimization basically uses AVX-512 light operations, which means the core power level is the same as AVX2. Per the initial observation, the frequency can both keep around 2.5GHz when running the workload w/ AVX2 and AVX-512.

Anyway your feedback is highly valuable indeed and more test/analysis will be conducted.

Yuhta commented 1 year ago

@yyang52 Do you have comparison for selective unpack (i.e. oddrows)?

oerling commented 1 year ago

We could always have a look at exactly how this works. Do we have the code for the experiments? This is a very localized change but probably does not show up in end to end times since bit unpacking is below 3% of profile in pretty much anything one does.

We cacan't depend on avx 512 but could use this in cases. The probably higher impact item would be to use avx 512 for filters. For examply you could ranslate the check of BigintValuesUsingHashTable to avx 512. There is also a compressed store instruction that we could use to replace some permutes if this were faster. Could we get a fast path of table scan that were 1.3x faster than the current one with avx 512? Maybe within the possible. If had to choose, optimizing timing of IO could be higher return. But this is ongoing.

yyang52 commented 1 year ago

Thanks @oerling @Yuhta for your feedback.

I have attach the files with related code: bitpackAVX512.tar.gz It will be great if you can have a look and help to review. While these are more of an initial draft version and I am still working on refining the code with my colleague @mrambacher. For now I put AVX-512 related files as utilities and not interrupt with the existing codes path (but could add some compile options and be more closely integrated), and put a benchmark to compare the performance w/ the updated benchmark results. The correctness is also tested, locally for now.

For the AVX-512 accelerating filters you mentioned, we have an initial investigation on that as well. You're pretty right that we can use compressed_store in AVX-512 to replace the permutes. I tried to implement the AVX-512 solution of one of the functions (called "processFixedFilter"), but the existing XSIMD might have some compile issues w/ AVX-512. Also, some of my colleagues are working on filter optimization on string/decimal types, which may be able to integrate to Velox and give more speedup.

Yuhta commented 1 year ago

For bitunpack code it should ok as long as it is compiled as a separate compilation unit, we can enable AVX512 just on that unit. Just do not forget to support selective unpacking.

For the filters it is the same, if it is just filter code and you can isolate the AVX512 part into its own compilation unit, we are probably fine.

The hard part is the selective reading code (e.g. ColumnVisitors.h) that is templated and cannot be isolated to its own compilation unit easily. For this you would need to fix the XSIMD usage to support all AVX512 cases. I doubt it is worth the trouble to do that.

yyang52 commented 1 year ago

Thanks for your great feedback.

Regarding the selective unpacking you mentioned, I have tried to implement such functionality and made some initial benchmarks of the oddRows unpacking cases. For bit-width 25 - 31, there are around 1.6x improvements, but I think it's due to the current selective unpack only supporting bit-width 1-24 (the decode1to24 function). The other cases do not get too much perf gain. But I also noticed the newly added fast c++ veloxBitUnpack seems not to support selective unpack yet either. Any plan on this part? I have attached the sample codes and benchmark results, you can also have a look. Thanks! unpackSelective.tar.gz

Yuhta commented 1 year ago

The new faster bulk unpack is used in very few places and still not complete. If you can have a complete selective AVX512 implementation that exceeds the performance, you can replace that with your implementation. Also as I mentioned earlier RleBpDecoder needs to be fixed, otherwise we are not really using it in a lot places.

yingsu00 commented 1 year ago

The reason I didn't try AVX512 was because Intel didn't make it ubiquitously available on its processors, while we can safely assume AVX2 is on almost all X86 processors.

yingsu00 commented 1 year ago

Btw I also have the selective unpacking that is 5x faster. Will send PRs later. cc @Yuhta @yyang52

yyang52 commented 1 year ago

That makes sense, but I think we could still provide the AVX512 solution as an alternative/fast path for supported platforms with careful checks to make sure it won't crash on other platforms :)

Sure, if you can add enough checks to make sure the running system has the AVX512 support.

A selective unpacking with 5x faster sounds great! Is it an AVX512 or AVX2 solution? What's the baseline?

It's with AVX2, comparing to the current implementation in Velox. I will send PR after the Parquet refactoring is done.

facebookincubator / velox

Improve Fast C++ Bit Unpacking #4103

Description