apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.67k stars 3.56k forks source link

[C++][Parquet] Investigate optimizing level decoding #40845

Open mapleFU opened 8 months ago

mapleFU commented 8 months ago

Describe the enhancement requested

Parquet C++ uses RLE encoding to encode the rep-level and def-levels. Benchmark can be seen here: https://github.com/apache/arrow/pull/39705#issuecomment-1921637432

Benchmark shows that, RLE would be a bit slower than BitPack. Besides, BitPack is also slow here, see https://github.com/apache/arrow/issues/39227#issue-2041139028 . Arrow don't have native unpack16, it change to unpack32 and memcpy 32 to 16. So, we have a wide space to optimize Parquet level decoding.

Currently, a RleDecoder ( https://github.com/apache/arrow/blob/dc2c5c66f5234a92169da76613399135786dbffb/cpp/src/arrow/util/rle_encoding.h#L88 ) is used. ( Maybe I should add a benchmark for that ? see: https://github.com/apache/arrow/issues/39630 ). I tried to optimize that but I found it's a bit hard to do so. Currently, bottleneck might be "find-each-run", which is hard to predict the pipeline

We can try different way to optimize level decoding:

  1. Trying native unpack16 for Levels. AVX2 and BMI2 can be full make use of. The link ( https://github.com/apache/arrow/issues/39227#issuecomment-1975256603 ) shows BMI2 can optimize unpack8. Maybe we can try unpack16 the same way.
  2. Trying auto-vectorization
  3. Other RLE optimizations
  4. Try to use uint8_t as level in memory

Component(s)

C++, Parquet

mapleFU commented 8 months ago

For SIMD decoding, I've read some materials:

  1. currently some impl (including arrow unpack32 ) uses similar methods in paper "SIMD-Scan: Ultra Fast in-Memory Table Scan using on Chip Vector Processing Units", I might investigate this for int8 and int16
  2. https://github.com/lemire/LittleIntPacker/blob/8777f574a5ab3c653881371819383c986292843c/src/bmipacking32.c#L2169 LittleIntPacker says that for little int stride, bmi2 + simd convert might be a good way. Velox also uses this method, I think some decoding method also better using this
mapleFU commented 7 months ago

Experimental: I've test unpack16 here: https://github.com/apache/arrow/commit/8a535943b2a875331a7f2c0b79f2b1907ba4252a .

ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2465 ns         2466 ns       281572 bytes_per_second=6.11536G/s items_per_second=3.28316G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8089 ns         8087 ns        86160 bytes_per_second=1.86483G/s items_per_second=1001.17M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           1107 ns         1107 ns       632608 bytes_per_second=13.6192G/s items_per_second=7.31173G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              1990 ns         1982 ns       351679 bytes_per_second=7.60825G/s items_per_second=4.08465G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2076 ns         2073 ns       337611 bytes_per_second=7.27283G/s items_per_second=3.90457G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2040 ns         2039 ns       331507 bytes_per_second=7.39673G/s items_per_second=3.97109G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              6757 ns         6753 ns       101800 bytes_per_second=2.23295G/s items_per_second=1.19881G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1464 ns         1465 ns       478498 bytes_per_second=10.2958G/s items_per_second=5.5275G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1505 ns         1496 ns       479725 bytes_per_second=10.0798G/s items_per_second=5.41153G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1611 ns         1547 ns       444939 bytes_per_second=9.75007G/s items_per_second=5.23453G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1500 ns         1497 ns       451485 bytes_per_second=10.0736G/s items_per_second=5.40825G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1858 ns         1848 ns       383159 bytes_per_second=8.1605G/s items_per_second=4.38114G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1907 ns         1865 ns       380770 bytes_per_second=8.08728G/s items_per_second=4.34183G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1826 ns         1824 ns       387361 bytes_per_second=8.26649G/s items_per_second=4.43804G/s

Before ( using unpack32):

ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2989 ns         2989 ns       234187 bytes_per_second=5.04491G/s items_per_second=2.70847G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8456 ns         8455 ns        83862 bytes_per_second=1.78358G/s items_per_second=957.552M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           1128 ns         1124 ns       612445 bytes_per_second=13.4204G/s items_per_second=7.20505G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2374 ns         2374 ns       284599 bytes_per_second=6.3533G/s items_per_second=3.4109G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2349 ns         2351 ns       299245 bytes_per_second=6.4148G/s items_per_second=3.44392G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              2124 ns         2124 ns       326544 bytes_per_second=7.09917G/s items_per_second=3.81134G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              7704 ns         7692 ns        92346 bytes_per_second=1.96045G/s items_per_second=1052.51M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1686 ns         1687 ns       396268 bytes_per_second=8.93824G/s items_per_second=4.79868G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1759 ns         1757 ns       411404 bytes_per_second=8.58509G/s items_per_second=4.60908G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1708 ns         1710 ns       404015 bytes_per_second=8.82047G/s items_per_second=4.73545G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1660 ns         1662 ns       413912 bytes_per_second=9.0731G/s items_per_second=4.87108G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1907 ns         1882 ns       386860 bytes_per_second=8.01277G/s items_per_second=4.30182G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1752 ns         1752 ns       397436 bytes_per_second=8.60686G/s items_per_second=4.62077G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1804 ns         1804 ns       385000 bytes_per_second=8.36066G/s items_per_second=4.48859G/s

The performance improvement is not high enough, so I'll record it here and repick it if we need

cc @pitrou @wgtmac

mapleFU commented 7 months ago

After adding __restrict__ to function, it grows a bit faster( but also not high)

ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2387 ns         2386 ns       292939 bytes_per_second=6.32071G/s items_per_second=3.39341G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              8246 ns         8215 ns        86704 bytes_per_second=1.8356G/s items_per_second=985.479M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           1160 ns         1134 ns       607143 bytes_per_second=13.294G/s items_per_second=7.13718G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              1939 ns         1942 ns       360373 bytes_per_second=7.76322G/s items_per_second=4.16785G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              2040 ns         2037 ns       345828 bytes_per_second=7.40175G/s items_per_second=3.97378G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              1980 ns         1983 ns       348470 bytes_per_second=7.60651G/s items_per_second=4.08371G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              6990 ns         6980 ns       102252 bytes_per_second=2.16034G/s items_per_second=1.15983G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1426 ns         1429 ns       491245 bytes_per_second=10.5547G/s items_per_second=5.66653G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1419 ns         1423 ns       489220 bytes_per_second=10.5941G/s items_per_second=5.68766G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       1429 ns         1433 ns       489309 bytes_per_second=10.5259G/s items_per_second=5.65103G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1429 ns         1431 ns       493124 bytes_per_second=10.5379G/s items_per_second=5.65749G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          1787 ns         1789 ns       367279 bytes_per_second=8.42841G/s items_per_second=4.52497G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          1760 ns         1764 ns       394862 bytes_per_second=8.54821G/s items_per_second=4.58928G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          1807 ns         1808 ns       390261 bytes_per_second=8.34108G/s items_per_second=4.47808G/s
mapleFU commented 7 months ago
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              6392 ns         6397 ns       110245 bytes_per_second=2.35721Gi/s items_per_second=1.26552G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7              9655 ns         9660 ns        73389 bytes_per_second=1.56112Gi/s items_per_second=838.118M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024            815 ns          816 ns       842744 bytes_per_second=18.4773Gi/s items_per_second=9.91994G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              6261 ns         6265 ns       111711 bytes_per_second=2.40713Gi/s items_per_second=1.29232G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              5864 ns         5868 ns       121156 bytes_per_second=2.56974Gi/s items_per_second=1.37962G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              5834 ns         5838 ns       121253 bytes_per_second=2.58328Gi/s items_per_second=1.38689G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7             10310 ns        10314 ns        67700 bytes_per_second=1.46203Gi/s items_per_second=784.92M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          5650 ns         5654 ns       122543 bytes_per_second=2.66698Gi/s items_per_second=1.43182G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          5708 ns         5712 ns       123613 bytes_per_second=2.64013Gi/s items_per_second=1.41741G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       5673 ns         5677 ns       122896 bytes_per_second=2.65629Gi/s items_per_second=1.42608G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          5572 ns         5576 ns       125720 bytes_per_second=2.70428Gi/s items_per_second=1.45185G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          5612 ns         5615 ns       128266 bytes_per_second=2.68549Gi/s items_per_second=1.44176G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          5511 ns         5516 ns       125777 bytes_per_second=2.73406Gi/s items_per_second=1.46784G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          5626 ns         5630 ns       127101 bytes_per_second=2.67841Gi/s items_per_second=1.43796G/s

avx2 doesn't have __m256i _mm256_sllv_epi16 (__m256i a, __m256i count), it's introduced in avx512, so it's even slower... So a shift way for int16 not works without avx512 :-(

mapleFU commented 7 months ago
# --- demo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1       rdtsc_overhead set to 42
1.14    1.55
2       1.12    1.50
3       1.19    1.55
4       0.88    1.92
5       1.11    1.70
6       1.20    1.67
7       1.34    1.70
8       1.03    1.91
9       1.53    1.70
10      1.58    1.73
11      1.67    1.73
12      1.62    1.97
13      1.73    1.83
14      1.75    1.89
15      1.62    1.91
16      0.41    1.95
17      1.86    1.94
18      1.83    2.02
19      1.91    1.97
20      1.75    2.19
21      1.94    2.14
22      1.69    2.14
23      1.78    2.19
24      1.34    2.20
25      1.70    2.36
26      1.73    2.36
27      1.75    2.34
28      1.58    2.62
29      1.78    2.20
30      1.86    2.31
31      1.97    2.31
32      0.36    0.34

# --- turbodemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1       1.00    1.92
2       1.00    1.56
3       1.00    1.55
4       1.36    1.53
5       1.00    1.59
6       1.00    1.38
7       1.08    1.70
8       1.14    1.91
9       1.28    1.92
10      1.23    1.72
11      1.33    1.94
12      1.30    1.67
13      1.33    1.95
14      1.33    1.73
15      1.38    1.95
16      1.14    1.91
17      1.42    1.91
18      1.31    1.70
19      1.42    1.91
20      1.33    1.72
21      1.50    1.92
22      1.56    1.75
23      1.69    1.95
24      1.39    1.95
25      1.83    1.95
26      1.81    1.83
27      1.78    1.95
28      1.62    1.88
29      1.92    1.97
30      1.94    1.92
31      1.97    2.02
32      0.36    1.77

# --- bmidemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1       1.00    0.47
2       1.00    0.45
3       1.00    0.50
4       1.36    0.45
5       1.00    0.53
6       1.00    1.36
7       1.08    1.70
8       1.14    1.91
9       1.28    1.94
10      1.23    1.72
11      1.33    1.94
12      1.30    1.67
13      1.33    1.95
14      1.33    1.73
15      1.38    1.95
16      1.14    1.91
17      1.42    1.92
18      1.31    1.69
19      1.42    1.91
20      1.33    1.70
21      1.50    1.94
22      1.56    1.75
23      1.69    1.95
24      1.39    1.94
25      1.83    1.94
26      1.81    1.83
27      1.78    1.95
28      1.62    1.88
29      1.92    1.97
30      1.94    1.92
31      1.97    2.03
32      0.36    1.75

# --- horizontaldemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1       1.16    0.66
2       1.12    0.66
3       1.19    0.66
4       0.88    0.69
5       1.11    0.72
6       1.20    0.70
7       1.34    0.67
8       1.03    0.34
9       1.52    0.69
10      1.56    0.67
11      1.67    0.67
12      1.62    0.69
13      1.73    0.70
14      1.75    0.69
15      1.62    0.67
16      0.41    0.34
17      1.86    0.69
18      1.83    0.69
19      1.91    0.70
20      1.75    0.69
21      1.94    0.70
22      1.67    0.70
23      1.78    0.69
24      1.34    0.34
25      1.72    0.69
26      1.73    0.69
27      1.75    0.89
28      1.58    0.69
29      1.78    1.03
30      1.86    1.09
31      1.97    1.05
32      0.36    0.33

# --- scdemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1       1.00    1.14
2       1.00    0.83
3       1.00    1.11
4       1.34    0.84
5       1.00    1.19
6       1.00    1.20
7       1.08    1.23
8       1.14    1.14
9       1.28    1.27
10      1.23    1.28
11      1.33    1.31
12      1.30    1.28
13      1.33    1.34
14      1.33    1.34
15      1.38    1.38
16      1.12    1.12
17      1.42    1.39
18      1.31    1.41
19      1.42    1.44
20      1.33    1.42
21      1.50    1.50
22      1.55    1.50
23      1.69    1.55
24      1.39    1.44
25      1.81    1.58
26      1.83    1.58
27      1.78    1.61
28      1.62    1.59
29      1.91    1.47
30      1.92    1.67
31      1.97    1.70
32      0.36    0.36

This benchmark runs benchmarks in Intel Xeon runs LittleIntPacker, seems bmi2 is faster ( author only implement 1-5 bitwidths for bmi2)

mapleFU commented 7 months ago

I write a naive bmi2 impl, in Intel Xeon:

ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              4729 ns         4734 ns       149748 bytes_per_second=3.18578Gi/s items_per_second=1.71035G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7             14434 ns        14453 ns        48206 bytes_per_second=1.04339Gi/s items_per_second=560.166M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           2763 ns         2769 ns       251599 bytes_per_second=5.44634Gi/s items_per_second=2.92398G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              4311 ns         4316 ns       162065 bytes_per_second=3.49358Gi/s items_per_second=1.8756G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              4122 ns         4125 ns       169853 bytes_per_second=3.65559Gi/s items_per_second=1.96258G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              3886 ns         3888 ns       179847 bytes_per_second=3.8783Gi/s items_per_second=2.08215G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7             13068 ns        13080 ns        53376 bytes_per_second=1.15291Gi/s items_per_second=618.963M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          3742 ns         3748 ns       186340 bytes_per_second=4.02357Gi/s items_per_second=2.16014G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          3745 ns         3750 ns       186374 bytes_per_second=4.02114Gi/s items_per_second=2.15883G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       3742 ns         3747 ns       186728 bytes_per_second=4.02498Gi/s items_per_second=2.1609G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          3424 ns         3429 ns       204069 bytes_per_second=4.39745Gi/s items_per_second=2.36086G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          3499 ns         3504 ns       199696 bytes_per_second=4.30356Gi/s items_per_second=2.31045G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          3311 ns         3315 ns       208379 bytes_per_second=4.54882Gi/s items_per_second=2.44213G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          3499 ns         3506 ns       199599 bytes_per_second=4.3016Gi/s items_per_second=2.3094G/s

Before:

ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              5534 ns         5541 ns       124334 bytes_per_second=2.7213Gi/s items_per_second=1.46099G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7             16683 ns        16703 ns        42139 bytes_per_second=924.523Mi/s items_per_second=484.716M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024           2594 ns         2597 ns       269554 bytes_per_second=5.80692Gi/s items_per_second=3.11757G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              4980 ns         4985 ns       140579 bytes_per_second=3.02534Gi/s items_per_second=1.62421G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1              4782 ns         4786 ns       146402 bytes_per_second=3.15082Gi/s items_per_second=1.69158G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1              4397 ns         4402 ns       159011 bytes_per_second=3.42562Gi/s items_per_second=1.83912G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7             15721 ns        15736 ns        44453 bytes_per_second=981.292Mi/s items_per_second=514.48M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          3272 ns         3274 ns       213994 bytes_per_second=4.60583Gi/s items_per_second=2.47273G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          3272 ns         3273 ns       213769 bytes_per_second=4.60763Gi/s items_per_second=2.4737G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024       3272 ns         3272 ns       213713 bytes_per_second=4.60907Gi/s items_per_second=2.47447G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          3270 ns         3273 ns       213934 bytes_per_second=4.60708Gi/s items_per_second=2.47341G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1          3314 ns         3321 ns       210344 bytes_per_second=4.5405Gi/s items_per_second=2.43766G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1          3273 ns         3279 ns       213189 bytes_per_second=4.5986Gi/s items_per_second=2.46885G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7          3310 ns         3318 ns       211121 bytes_per_second=4.54534Gi/s items_per_second=2.44026G/s

In the senerio of Rle Read levels, performance grows faster, but in BitPack, it even grows slower. I guess it could benifit performance when number of input is small

mapleFU commented 7 months ago

TBD: I'll test auto-vectorize like arrow-rs and impala. And investigate how to make full use of bmi2, since it seems works well when number of input is small