facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.48k stars 1.14k forks source link

[Design] Fast C++ Bit Unpacking #2348

Closed yingsu00 closed 2 years ago

yingsu00 commented 2 years ago

ying@ahana.io

Parquet uses Rle and Bitpacking encoding extensively. The usages in Parquet V1 include

It’s crucial to have a fast bit unpacking library for the Parquet reader. In this doc we present the design and lessons learnt from implementing this little lib.

Bulk Read Contiguous Values

SIMD Algorithms

AVX2 and BMI2 are pretty much available on almost all x86/amd64 CPUs nowadays. We are going to use the _pdep_u64 intrinsics in BMI2 and _mm256_cvtepi[x]_epi[y] intrinsics in AVX2. In Lemire’s https://github.com/lemire/LittleIntPacker the bmipacking32.c uses these intrinsics for some bitWidths, but uses non-SIMD implementations for other bit widths. I guess they were faster than using the SIMD intrinsics when the this library was made. But since this was many years ago, what was faster at that time might not be as fast as the intrinsics now. On Intel CPUs, these SIMD intrinsics are faster than 10 years ago while other instructions may not improve as much. In my benchmarks on Intel CoffeeLake, the algorithms using these SIMD intrinsics outperformed all other non-SIMD implementations with manually unrolled loops and plain bit operations like “and”, “or”, and “shiftings”.

However we shall notice that these intrinsics are slower on AMD CPUs. M1 and other ARM processors may also have different costs. Therefore the design for these CPUs would be different. This design is for Intel CPUs that supports AVX2s only.

The actual design is also relevant to the input bit width and output types. The output types we support now are:

The input bit width for a certain output type is in [1, bits in output type]. In Parquet, all bit packed runs contain a multiple of 8 values. In our implementations, we would process a multiple of 8 values at a time. One advantage of this is that we always process the full bytes in every iteration without having to shift bits, which is quite expensive.

Now we talk about the implementation for each output type.

uint8_t

When the output is uint8_t, we process 8 values a time, and the input buffer pointer would move input “bitWidth” bytes. We load the memory into an integer, then use _pdep_u64 to deposit/scatter the bits in this integer to a uint64_t using a predefined mask. For example if input bitWidth is 4 the mask is 0x0f0f0f0f0f0f0f0f. Then the bitpacked value 0x12345678 would be unpacked to 0x0102030405060708.

Note that the bitWidth <= 8. If we cast the input buffer to uint64_t like this `uint64_t val = reinterpret_cast<const uint64_t>(inputBuffer)and don’t specially handle the tail, the code might fail with segmentation fault because the last 8 values are less than 64 bits. In the real implementation, we could either directly dereference the input buffer after casting it touint64_t oruint32*_twith special handling of last 8 values, or we could usestd::memcpyto copy exactly the number of bytes needed. However, I found that it affects the performance a lot sometimes. The clang compiler SOMETIMES fails to optimize the loop usingstd::memcpy`. Let’s look at the disassembly of these two simple functions:

// 266us to decode 8M values
static inline void unpack1(
        const uint8_t *&inputBuffer,
        uint64_t numValues,
        uint8_t *&outputBuffer) {
    uint64_t numBytes = (numValues + 7) / 8;
    auto readEndOffset = inputBuffer + numBytes;

    while (inputBuffer < readEndOffset) {
        uint64_t val = *reinterpret_cast<const uint8_t *>(inputBuffer);
        *(reinterpret_cast<uint64_t *>(outputBuffer)) = _pdep_u64(val, kPdepMask8[1]);
        inputBuffer++;
        outputBuffer += 8;
    }
}

The corresponding assembly code:

0000000100003480 <__ZL15unpack1to8_casthRPKhyyRPh>:
100003480: 55                    pushq  %rbp
100003481: 48 89 e5              movq   %rsp, %rbp
100003484: 48 8b 17              movq   (%rdi), %rdx         #  rdi Function argument #1 (inputBuffer)  
100003487: 4c 8d 8a 00 75 6d 00  leaq   7173376(%rdx), %r9   # rdx Function argument #3 (inputBufferLen)
10000348e: 48 8b 0e              movq   (%rsi), %rcx        #  rsi Function argument #2 (inputBuffer)   rcx Function argument #4 (numValue)
100003491: 49 b8 7f 7f 7f 7f 7f 7f 7f 7f movabsq    $9187201950435737471, %r8
10000349b: 0f 1f 44 00 00        nopl   (%rax,%rax)
1000034a0: 48 8b 02              movq   (%rdx), %rax
1000034a3: c4 c2 fb f5 c0        pdepq  %r8, %rax, %rax
1000034a8: 48 89 01              movq   %rax, (%rcx)
1000034ab: 48 83 c2 07           addq   $7, %rdx
1000034af: 48 89 17              movq   %rdx, (%rdi)
1000034b2: 48 8b 0e              movq   (%rsi), %rcx
1000034b5: 48 83 c1 08           addq   $8, %rcx
1000034b9: 48 89 0e              movq   %rcx, (%rsi)       
1000034bc: 48 8b 17              movq   (%rdi), %rdx
1000034bf: 4c 39 ca              cmpq   %r9, %rdx
1000034c2: 72 dc                 jb 0x1000034a0 <__ZL15unpack1to8_casthRPKhyyRPh+0x20>
1000034c4: 5d                    popq   %rbp
1000034c5: c3                    retq
1000034c6: 66 2e 0f 1f 84 00 00 00 00 00 nopw   %cs:(%rax,%rax)

Code using memcpy:

// 1998us to decode 8M values
static inline void unpack1_memcpy(
        const uint8_t *&inputBuffer,
        uint64_t numValues,
        uint8_t *&outputBuffer) {
    uint64_t numBytes = (numValues + 7) / 8;
    auto readEndOffset = inputBuffer + numBytes;

    uint64_t val = 0;
    while (inputBuffer < readEndOffset) {
        std::memcpy(&val, inputBuffer, 1);
        *(reinterpret_cast<uint64_t *>(outputBuffer)) =
                _pdep_u64(val, kPdepMask8[1]);
        inputBuffer++;
        outputBuffer += 8;
    }
}
00000001000034d0 <__ZL17unpack1to8_memcpyhRPKhyyRPh>:
1000034d0: 55                    pushq  %rbp
1000034d1: 48 89 e5              movq   %rsp, %rbp
1000034d4: 41 57                 pushq  %r15
1000034d6: 41 56                 pushq  %r14
1000034d8: 41 55                 pushq  %r13
1000034da: 41 54                 pushq  %r12
1000034dc: 53                    pushq  %rbx
1000034dd: 48 83 ec 18           subq   $24, %rsp
1000034e1: 48 8b 1e              movq   (%rsi), %rbx
1000034e4: 48 c7 45 d0 00 00 00 00movq  $0, -48(%rbp)
1000034ec: 40 84 ff              testb  %dil, %dil
1000034ef: 0f 84 61 00 00 00     je 0x100003556 <__ZL17unpack1to8_memcpyhRPKhyyRPh+0x86>
1000034f5: 49 89 d6              movq   %rdx, %r14
1000034f8: 49 89 f7              movq   %rsi, %r15
1000034fb: 41 89 fc              movl   %edi, %r12d
1000034fe: 48 8d 05 0b 0a 00 00  leaq   2571(%rip), %rax  # 100003f10 <__ZL10kPdepMask8>
100003505: 4a 8b 04 e0           movq   (%rax,%r12,8), %rax
100003509: 48 89 45 c0           movq   %rax, -64(%rbp)
10000350d: 49 69 c4 00 a3 0f 00  imulq  $1024768, %r12, %rax
100003514: 48 01 d8              addq   %rbx, %rax
100003517: 48 89 45 c8           movq   %rax, -56(%rbp)
10000351b: 4c 8b 2a              movq   (%rdx), %r13
10000351e: 66 90                 nop
100003520: 48 8d 7d d0           leaq   -48(%rbp), %rdi
100003524: 48 89 de              movq   %rbx, %rsi
100003527: 4c 89 e2              movq   %r12, %rdx
10000352a: e8 e3 07 00 00        callq  0x100003d12 <dyld_stub_binder+0x100003d12>
10000352f: 48 8b 45 d0           movq   -48(%rbp), %rax
100003533: c4 e2 fb f5 45 c0     pdepq  -64(%rbp), %rax, %rax
100003539: 49 89 45 00           movq   %rax, (%r13)
10000353d: 4c 01 e3              addq   %r12, %rbx
100003540: 49 89 1f              movq   %rbx, (%r15)
100003543: 4d 8b 2e              movq   (%r14), %r13
100003546: 49 83 c5 08           addq   $8, %r13
10000354a: 4d 89 2e              movq   %r13, (%r14)
10000354d: 49 8b 1f              movq   (%r15), %rbx
100003550: 48 3b 5d c8           cmpq   -56(%rbp), %rbx
100003554: 72 ca                 jb 0x100003520 <__ZL17unpack1to8_memcpyhRPKhyyRPh+0x50>
100003556: 48 83 c4 18           addq   $24, %rsp
10000355a: 5b                    popq   %rbx
10000355b: 41 5c                 popq   %r12
10000355d: 41 5d                 popq   %r13
10000355f: 41 5e                 popq   %r14
100003561: 41 5f                 popq   %r15
100003563: 5d                    popq   %rbp
100003564: c3                    retq
100003565: 66 2e 0f 1f 84 00 00 00 00 00 nopw   %cs:(%rax,%rax)
10000356f: 90     

We can see the implementation using memcpy spends many more instructions calculating the address. In my benchmarks the second implementation took 1998us to decode 8M values, while the first one only took 266us. This might be a clang issue and it does not always happen. But to avoid such possible regression, I chose to use casting in the loop and get the last 8 values using std::memcpy. Btw it seems Lemire’s bmipacking implementation didn’t handle this and always assume there're >=8 bytes left in the buffer.

Note that even though we don't exhaust all the read 64bits in one iteration and the read might not be aligned, it doesn't seem to affect the performance. This may be a result of CPU caching because a cache line is usually (64 bytes) much larger than 8 bytes, and subsequent reads shall read from the cache.

When the input bitWidth is 8, we can just do simple memcpy. However clang does a good job here and using memcpy has the SAME performance as the algorithm using _pdep_u64. To simplify the code I didn’t special case it.

Another takeaway is that manually unrolling the loop does NOT make any difference. There is no data dependency among the loop iterations, and just using a plain loop can make the code shorter.

uint16_t

bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.

Copy 2 bitWidth bytes into 2 integers, 8 values each. Then call _pdep_u64 on each of the integers to deposit the values from bitWidth to 8 bits wide, and store the 8 2 output values to a piece of memory aligned by 16 bytes. Now use _mm256_cvtepu8_epi16 to cast these 16 values in the register and store them back to memory.

Note that using memcpy for uint16_t doesn’t make much difference that we saw in uint8_t. It is even slightly faster than dereferencing the memory to a uint64_t integer and shifting its bits before the second _pdep_u64.

Another observation is that manually unrolling the loop to process 32 or 64 values in one iteration does NOT make any difference.

bitWidth = 8: Process 2 * bitWidth bytes (16 values) a time.

For this case we just simply call _mm256_cvtepu8_epi16 without _pdep_u64. This is about 10% faster than the above implementation for 1-7 bitWidth.

bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.

We read the memory twice in each iteration of the loop. The first time we read ceil(bitWidth / 2) bytes, and the second time the rest that make up the 8 values. E.g. if bitWidth = 9, read 5 bytes to the first integer, and the next 4 bytes to the second integer. Then _pdep_u64 to deposit the 4 values in the first integer to 16 bits directly. Then shift the first integer right for 36 bits, and second integer left 4 bits and or the two result into the 3rd integer, this is to use up the remaining unhandled 4 bits in the first integer. Call _pdep_u64 on the third integer and store the second 4 values to memory.

When the bitWidth is even, we don’t need to shift the integers. However special casing the even bitWidth doesn’t change the performance. To make it simple, we use the above implementation for both odd and even bitWidth.

bitWidth = 16

Just memcpy. It's faster than the above.

Uint32_t

bitWidth in [1, 7]: Process 2 * bitWidth bytes (16 values) a time.

The same as uint16_t. Manually unrolling does NOT make a difference.

bitWidth = 8 : Process 8 bytes (8 values) a time.

Directly use _mm256_cvtepu8_epi32 to convert 8 bits to 32 bits.

bitWidth in [9, 15]: Process bitWidth bytes (2 * 4 values) a time.

Same as uint16_t

bitWidth = 16

Directly use _mm256_cvtepu16_epi32 to convert 16 bits to 32 bits.

Other Considerations

Function overloading vs. templates

We could choose to use templated functions with output type as a template parameter, or choose function overloading. There're 2 reasons I chose function overloading other than templates:

  1. Functions for different output types have different implementations. Templated functions are convenient if the implementation logic for different types is identical, but for our library, it's not the case. Using templated functions makes the code harder to read and maintain.
  2. The output types can only be 1, 2, 4, 8 byte integers. We don't need to generate code for the other types.

Class vs. Global Functions In Utility Headers

We could wrap the input and output buffer in a class and provide public functions to unpack the data. However benchmarks show that calling the functions on a class instance is slower than putting them as global functions. This may be related to function dispatch cost and/or inlining. So I chose to put the functions in a header file. The full implementation is about 500 lines of code and is pretty small.

Function Pointers vs. Switches

The performance of these two are about the same. Since for a given column, the input bitWidth and output type are fixed, CPU can predict the correct function or switches and the cost of function dispatching is very minimal.

Benchmark Results

The following benchmark results are from BitUnpackingBenchmark.cpp for unpacking 8M values. The code was compiled by Apple clang version 12.0.5 (clang-1205.0.22.11), on MacOS with CPU Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (CoffeeLake). The results are in microseconds(us).

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Output Bit Width | Bit Width | arrow | duckdb | fastpforlib | lemire | velox -- | -- | -- | -- | -- | -- | -- 8 | 1 | 2,750 | 10,430 | 4,280 |   | 292 8 | 2 | 2,730 | 11,660 | 3,880 |   | 307 8 | 3 | 2,790 | 12,590 | 4,640 |   | 317 8 | 4 | 2,680 | 13,890 | 3,450 |   | 334 8 | 5 | 3,030 | 15,470 | 4,750 |   | 361 8 | 6 | 3,120 | 16,780 | 4,500 |   | 460 8 | 7 | 3,380 | 18,110 | 5,610 |   | 487 8 | 8 | 2,900 | 19,150 | 2,680 |   | 521 16 | 1 | 2,750 | 10,660 | 2,320 |   | 549 16 | 2 | 2,670 | 11,770 | 2,280 |   | 585 16 | 3 | 2,790 | 12,770 | 2,620 |   | 709 16 | 4 | 2,610 | 14,130 | 2,320 |   | 751 16 | 5 | 3,050 | 16,840 | 2,990 |   | 820 16 | 6 | 3,030 | 16,850 | 6,200 |   | 890 16 | 7 | 3,210 | 18,080 | 3,250 |   | 1,130 16 | 8 | 2,850 | 19,540 | 2,380 |   | 956 16 | 9 | 3,510 | 21,530 | 3,520 |   | 1,340 16 | 10 | 3,570 | 23,160 | 3,500 |   | 2,100 16 | 11 | 3,790 | 24,290 | 3,840 |   | 2,190 16 | 12 | 3,680 | 25,780 | 3,600 |   | 2,260 16 | 13 | 3,970 | 27,090 | 3,990 |   | 2,580 16 | 14 | 3,990 | 28,940 | 4,080 |   | 2,330 16 | 15 | 4,150 | 30,660 | 4,330 |   | 2,440 16 | 16 | 3,230 | 32,570 | 2,970 |   | 2,370 32 | 1 | 2,730 | 10,740 | 2,530 | 1,520 | 1,620 32 | 2 | 2,680 | 11,670 | 2,550 | 1,580 | 1,650 32 | 3 | 2,800 | 12,820 | 2,810 | 1,640 | 1,700 32 | 4 | 2,640 | 14,110 | 2,530 | 1,650 | 1,730 32 | 5 | 3,080 | 15,510 | 2,950 | 1,810 | 1,830 32 | 6 | 3,150 | 16,680 | 2,950 | 2,250 | 1,900 32 | 7 | 3,240 | 17,810 | 3,480 | 2,280 | 1,920 32 | 8 | 2,950 | 19,140 | 2,610 | 2,380 | 1,960 32 | 9 | 3,430 | 27,230 | 3,760 | 2,480 | 2,060 32 | 10 | 3,570 | 34,630 | 3,780 | 2,440 | 2,090 32 | 11 | 3,730 | 41,070 | 4,060 | 2,740 | 2,220 32 | 12 | 3,660 | 48,670 | 3,850 | 2,730 | 2,240 32 | 13 | 3,950 | 55,820 | 4,100 | 3,030 | 2,330 32 | 14 | 3,990 | 62,250 | 3,680 | 3,270 | 2,580 32 | 15 | 4,130 | 70,350 | 3,810 | 3,720 | 2,470 32 | 16 | 3,240 | 77,600 | 3,490 | 3,840 | 2,340

Read and Skip Values In a RowSet

Sometimes the queries have filters. Velox would read the columns with filters first, and skip the rows that didn’t pass the filters when reading later columns. Our bit unpacking library will support reading non-contiguous values.

(To Be Finished)

Push Down Filters To Bit Packed Data

(To Be Finished)

yingsu00 commented 2 years ago

cc @oerling