Open yyang52 opened 1 year ago
Cc: @yingsu00 @oerling @Yuhta who did many experiments with bit unpacking. Did we end up testing avx 512?
@pedroerp Not as far as I see. To experiment this, we need to
RleBpDecoder
for multibit levels firstRleBpDecoder
to see if an extra AVX512 implementation outweighs the complexity addedThanks @Yuhta.
@yyang52 if you're willing to work on this, I think the first step here would be to put a small micro-benchmark together to see whether bit unpacking would actually go faster with avx 512. We tried avx512 in the past in a few other parts of the code, and the results were not as fast as expected. I remember hearing that avx 512 might end up reducing the clock in some architectures. Still, would be interesting to benchmark.
Thanks @pedroerp @Yuhta for your feedback.
We surely did some micro-benchmark to compare current AVX2/BMI2 solution with our AVX-512 optimized solution, and got the initial results as below (The benchmark setup is basically the same with BitpackDecoderBenchmark.cpp, with clang-12 compiled, on ICX with CPU Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz):
=======================================================================================
[...]/tests/BitPackDecoderBenchmarknew.cpp time/iter iters/s Speedup ratio (AVX-512 vs. AVX2)
=======================================================================================
velox_unpack_fullrows_switch_1_8 412.28us 2.43K
avx512_new_unpack_fullrows_switch_1_8 366.30us 2.73K 1.13x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_2_8 431.92us 2.32K
avx512_new_unpack_fullrows_switch_2_8 373.58us 2.68K 1.16x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_3_8 474.15us 2.11K
avx512_new_unpack_fullrows_switch_3_8 412.88us 2.42K 1.15x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_4_8 492.42us 2.03K
avx512_new_unpack_fullrows_switch_4_8 441.36us 2.27K 1.12x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_5_8 514.35us 1.94K
avx512_new_unpack_fullrows_switch_5_8 477.22us 2.10K 1.08x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_6_8 542.78us 1.84K
avx512_new_unpack_fullrows_switch_6_8 511.16us 1.96K 1.06x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_7_8 577.04us 1.73K
avx512_new_unpack_fullrows_switch_7_8 546.63us 1.83K 1.06x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_8_8 583.99us 1.71K
avx512_new_unpack_fullrows_switch_8_8 487.49us 2.05K 1.20x
----------------------------------------------------------------------------
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_1_16 791.99us 1.26K
avx512_new_unpack_fullrows_switch_1_16 782.97us 1.28K 1.01x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_2_16 806.32us 1.24K
avx512_new_unpack_fullrows_switch_2_16 687.39us 1.45K 1.17x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_3_16 858.88us 1.16K
avx512_new_unpack_fullrows_switch_3_16 731.05us 1.37K 1.17x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_4_16 915.86us 1.09K
avx512_new_unpack_fullrows_switch_4_16 771.22us 1.30K 1.19x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_5_16 936.07us 1.07K
avx512_new_unpack_fullrows_switch_5_16 811.02us 1.23K 1.15x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_6_16 989.18us 1.01K
avx512_new_unpack_fullrows_switch_6_16 848.66us 1.18K 1.17x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_7_16 1.13ms 882.83
avx512_new_unpack_fullrows_switch_7_16 904.03us 1.11K 1.25x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_8_16 1.14ms 876.66
avx512_new_unpack_fullrows_switch_8_16 1.13ms 883.98 1.01x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_9_16 1.41ms 711.71
avx512_new_unpack_fullrows_switch_9_16 1.03ms 972.76 1.37x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_10_16 1.38ms 723.75
avx512_new_unpack_fullrows_switch_10_16 1.07ms 932.85 1.29x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_11_16 1.75ms 572.91
avx512_new_unpack_fullrows_switch_11_16 1.13ms 885.27 1.55x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_12_16 1.52ms 659.99
avx512_new_unpack_fullrows_switch_12_16 1.19ms 842.70 1.28x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_13_16 2.06ms 484.73
avx512_new_unpack_fullrows_switch_13_16 1.25ms 802.24 1.66x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_14_16 1.89ms 529.41
avx512_new_unpack_fullrows_switch_14_16 1.31ms 764.70 1.44x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_15_16 1.83ms 547.45
avx512_new_unpack_fullrows_switch_15_16 1.37ms 728.07 1.33x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_16_16 980.40us 1.02K
avx512_new_unpack_fullrows_switch_16_16 979.07us 1.02K 1.00x
----------------------------------------------------------------------------
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_1_32 2.36ms 423.05
avx512_new_unpack_fullrows_switch_1_32 2.12ms 472.22 1.12x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_2_32 2.45ms 408.24
avx512_new_unpack_fullrows_switch_2_32 1.67ms 598.45 1.47x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_3_32 2.50ms 400.42
avx512_new_unpack_fullrows_switch_3_32 1.75ms 571.69 1.43x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_4_32 2.66ms 376.17
avx512_new_unpack_fullrows_switch_4_32 1.81ms 551.54 1.47x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_5_32 2.75ms 363.89
avx512_new_unpack_fullrows_switch_5_32 2.06ms 485.34 1.33x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_6_32 2.82ms 354.63
avx512_new_unpack_fullrows_switch_6_32 2.15ms 465.99 1.31x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_7_32 2.78ms 359.78
avx512_new_unpack_fullrows_switch_7_32 2.30ms 435.36 1.21x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_8_32 2.90ms 344.44
avx512_new_unpack_fullrows_switch_8_32 2.15ms 465.08 1.35x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_9_32 2.83ms 353.60
avx512_new_unpack_fullrows_switch_9_32 2.31ms 432.96 1.22x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_10_32 2.90ms 345.37
avx512_new_unpack_fullrows_switch_10_32 2.40ms 415.88 1.20x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_11_32 2.97ms 336.26
avx512_new_unpack_fullrows_switch_11_32 2.51ms 398.29 1.18x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_13_32 3.09ms 323.69
avx512_new_unpack_fullrows_switch_13_32 2.74ms 365.63 1.13x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_15_32 3.24ms 308.36
avx512_new_unpack_fullrows_switch_15_32 2.91ms 343.71 1.11x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_17_32 3.60ms 277.60
avx512_new_unpack_fullrows_switch_17_32 3.05ms 327.52 1.18x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_19_32 3.83ms 260.80
avx512_new_unpack_fullrows_switch_19_32 3.24ms 308.34 1.18x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_21_32 3.97ms 251.90
avx512_new_unpack_fullrows_switch_21_32 3.44ms 290.41 1.15x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_24_32 4.15ms 240.89
avx512_new_unpack_fullrows_switch_24_32 3.74ms 267.19 1.11x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_28_32 4.56ms 219.46
avx512_new_unpack_fullrows_switch_28_32 4.13ms 241.99 1.10x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_30_32 4.74ms 211.18
avx512_new_unpack_fullrows_switch_30_32 4.32ms 231.39 1.10x
----------------------------------------------------------------------------
velox_unpack_fullrows_switch_32_32 2.57ms 389.45
avx512_new_unpack_fullrows_switch_32_32 2.52ms 396.50 1.02x
----------------------------------------------------------------------------
And regards to the concern of AVX-512 might reduce the clock in some architectures, it might not be a big drawback because the optimization basically uses AVX-512 light operations, which means the core power level is the same as AVX2. Per the initial observation, the frequency can both keep around 2.5GHz when running the workload w/ AVX2 and AVX-512.
Anyway your feedback is highly valuable indeed and more test/analysis will be conducted.
@yyang52 Do you have comparison for selective unpack (i.e. oddrows
)?
We could always have a look at exactly how this works. Do we have the code for the experiments? This is a very localized change but probably does not show up in end to end times since bit unpacking is below 3% of profile in pretty much anything one does.
We cacan't depend on avx 512 but could use this in cases. The probably higher impact item would be to use avx 512 for filters. For examply you could ranslate the check of BigintValuesUsingHashTable to avx 512. There is also a compressed store instruction that we could use to replace some permutes if this were faster. Could we get a fast path of table scan that were 1.3x faster than the current one with avx 512? Maybe within the possible. If had to choose, optimizing timing of IO could be higher return. But this is ongoing.
Thanks @oerling @Yuhta for your feedback.
I have attach the files with related code: bitpackAVX512.tar.gz It will be great if you can have a look and help to review. While these are more of an initial draft version and I am still working on refining the code with my colleague @mrambacher. For now I put AVX-512 related files as utilities and not interrupt with the existing codes path (but could add some compile options and be more closely integrated), and put a benchmark to compare the performance w/ the updated benchmark results. The correctness is also tested, locally for now.
For the AVX-512 accelerating filters you mentioned, we have an initial investigation on that as well. You're pretty right that we can use compressed_store in AVX-512 to replace the permutes. I tried to implement the AVX-512 solution of one of the functions (called "processFixedFilter"), but the existing XSIMD might have some compile issues w/ AVX-512. Also, some of my colleagues are working on filter optimization on string/decimal types, which may be able to integrate to Velox and give more speedup.
For bitunpack code it should ok as long as it is compiled as a separate compilation unit, we can enable AVX512 just on that unit. Just do not forget to support selective unpacking.
For the filters it is the same, if it is just filter code and you can isolate the AVX512 part into its own compilation unit, we are probably fine.
The hard part is the selective reading code (e.g. ColumnVisitors.h
) that is templated and cannot be isolated to its own compilation unit easily. For this you would need to fix the XSIMD usage to support all AVX512 cases. I doubt it is worth the trouble to do that.
Thanks for your great feedback.
Regarding the selective unpacking you mentioned, I have tried to implement such functionality and made some initial
benchmarks of the oddRows unpacking cases. For bit-width 25 - 31, there are around 1.6x improvements, but I think it's
due to the current selective unpack only supporting bit-width 1-24 (the decode1to24
function). The other cases do not get
too much perf gain. But I also noticed the newly added fast c++ veloxBitUnpack seems not to support selective unpack yet
either. Any plan on this part? I have attached the sample codes and benchmark results, you can also have a look. Thanks!
unpackSelective.tar.gz
The new faster bulk unpack is used in very few places and still not complete. If you can have a complete selective AVX512 implementation that exceeds the performance, you can replace that with your implementation. Also as I mentioned earlier RleBpDecoder
needs to be fixed, otherwise we are not really using it in a lot places.
The reason I didn't try AVX512 was because Intel didn't make it ubiquitously available on its processors, while we can safely assume AVX2 is on almost all X86 processors.
Btw I also have the selective unpacking that is 5x faster. Will send PRs later. cc @Yuhta @yyang52
That makes sense, but I think we could still provide the AVX512 solution as an alternative/fast path for supported platforms with careful checks to make sure it won't crash on other platforms :)
Sure, if you can add enough checks to make sure the running system has the AVX512 support.
A selective unpacking with 5x faster sounds great! Is it an AVX512 or AVX2 solution? What's the baseline?
It's with AVX2, comparing to the current implementation in Velox. I will send PR after the Parquet refactoring is done.
Description
RLE/Bitpacking encoding provides high efficiency for Parquet and is heavily used. When reading parquet files, it's important to do a fast path bit unpacking to improve the efficiency. For now Velox has a fast C++ Bit Unpacking library with AVX2 and BMI2 instructions. While with AVX512's VBMI and other instructions, the performance could be improved further (our initial benchmark shows up to 1.6X boost compared to current AVX2/BMI2 implementation with clang complier). As more and more platforms will be supporting avx512 instructions, maybe we could do some further optimizations with that.