Open mapleFU opened 8 months ago
For SIMD decoding, I've read some materials:
Experimental: I've test unpack16 here: https://github.com/apache/arrow/commit/8a535943b2a875331a7f2c0b79f2b1907ba4252a .
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 2465 ns 2466 ns 281572 bytes_per_second=6.11536G/s items_per_second=3.28316G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 8089 ns 8087 ns 86160 bytes_per_second=1.86483G/s items_per_second=1001.17M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 1107 ns 1107 ns 632608 bytes_per_second=13.6192G/s items_per_second=7.31173G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1990 ns 1982 ns 351679 bytes_per_second=7.60825G/s items_per_second=4.08465G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 2076 ns 2073 ns 337611 bytes_per_second=7.27283G/s items_per_second=3.90457G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 2040 ns 2039 ns 331507 bytes_per_second=7.39673G/s items_per_second=3.97109G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 6757 ns 6753 ns 101800 bytes_per_second=2.23295G/s items_per_second=1.19881G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 1464 ns 1465 ns 478498 bytes_per_second=10.2958G/s items_per_second=5.5275G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 1505 ns 1496 ns 479725 bytes_per_second=10.0798G/s items_per_second=5.41153G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 1611 ns 1547 ns 444939 bytes_per_second=9.75007G/s items_per_second=5.23453G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1500 ns 1497 ns 451485 bytes_per_second=10.0736G/s items_per_second=5.40825G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 1858 ns 1848 ns 383159 bytes_per_second=8.1605G/s items_per_second=4.38114G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1907 ns 1865 ns 380770 bytes_per_second=8.08728G/s items_per_second=4.34183G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 1826 ns 1824 ns 387361 bytes_per_second=8.26649G/s items_per_second=4.43804G/s
Before ( using unpack32):
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 2989 ns 2989 ns 234187 bytes_per_second=5.04491G/s items_per_second=2.70847G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 8456 ns 8455 ns 83862 bytes_per_second=1.78358G/s items_per_second=957.552M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 1128 ns 1124 ns 612445 bytes_per_second=13.4204G/s items_per_second=7.20505G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 2374 ns 2374 ns 284599 bytes_per_second=6.3533G/s items_per_second=3.4109G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 2349 ns 2351 ns 299245 bytes_per_second=6.4148G/s items_per_second=3.44392G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 2124 ns 2124 ns 326544 bytes_per_second=7.09917G/s items_per_second=3.81134G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 7704 ns 7692 ns 92346 bytes_per_second=1.96045G/s items_per_second=1052.51M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 1686 ns 1687 ns 396268 bytes_per_second=8.93824G/s items_per_second=4.79868G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 1759 ns 1757 ns 411404 bytes_per_second=8.58509G/s items_per_second=4.60908G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 1708 ns 1710 ns 404015 bytes_per_second=8.82047G/s items_per_second=4.73545G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1660 ns 1662 ns 413912 bytes_per_second=9.0731G/s items_per_second=4.87108G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 1907 ns 1882 ns 386860 bytes_per_second=8.01277G/s items_per_second=4.30182G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1752 ns 1752 ns 397436 bytes_per_second=8.60686G/s items_per_second=4.62077G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 1804 ns 1804 ns 385000 bytes_per_second=8.36066G/s items_per_second=4.48859G/s
The performance improvement is not high enough, so I'll record it here and repick it if we need
cc @pitrou @wgtmac
After adding __restrict__
to function, it grows a bit faster( but also not high)
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 2387 ns 2386 ns 292939 bytes_per_second=6.32071G/s items_per_second=3.39341G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 8246 ns 8215 ns 86704 bytes_per_second=1.8356G/s items_per_second=985.479M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 1160 ns 1134 ns 607143 bytes_per_second=13.294G/s items_per_second=7.13718G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1939 ns 1942 ns 360373 bytes_per_second=7.76322G/s items_per_second=4.16785G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 2040 ns 2037 ns 345828 bytes_per_second=7.40175G/s items_per_second=3.97378G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1980 ns 1983 ns 348470 bytes_per_second=7.60651G/s items_per_second=4.08371G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 6990 ns 6980 ns 102252 bytes_per_second=2.16034G/s items_per_second=1.15983G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 1426 ns 1429 ns 491245 bytes_per_second=10.5547G/s items_per_second=5.66653G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 1419 ns 1423 ns 489220 bytes_per_second=10.5941G/s items_per_second=5.68766G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 1429 ns 1433 ns 489309 bytes_per_second=10.5259G/s items_per_second=5.65103G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1429 ns 1431 ns 493124 bytes_per_second=10.5379G/s items_per_second=5.65749G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 1787 ns 1789 ns 367279 bytes_per_second=8.42841G/s items_per_second=4.52497G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 1760 ns 1764 ns 394862 bytes_per_second=8.54821G/s items_per_second=4.58928G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 1807 ns 1808 ns 390261 bytes_per_second=8.34108G/s items_per_second=4.47808G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 6392 ns 6397 ns 110245 bytes_per_second=2.35721Gi/s items_per_second=1.26552G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 9655 ns 9660 ns 73389 bytes_per_second=1.56112Gi/s items_per_second=838.118M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 815 ns 816 ns 842744 bytes_per_second=18.4773Gi/s items_per_second=9.91994G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 6261 ns 6265 ns 111711 bytes_per_second=2.40713Gi/s items_per_second=1.29232G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 5864 ns 5868 ns 121156 bytes_per_second=2.56974Gi/s items_per_second=1.37962G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 5834 ns 5838 ns 121253 bytes_per_second=2.58328Gi/s items_per_second=1.38689G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 10310 ns 10314 ns 67700 bytes_per_second=1.46203Gi/s items_per_second=784.92M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 5650 ns 5654 ns 122543 bytes_per_second=2.66698Gi/s items_per_second=1.43182G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 5708 ns 5712 ns 123613 bytes_per_second=2.64013Gi/s items_per_second=1.41741G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 5673 ns 5677 ns 122896 bytes_per_second=2.65629Gi/s items_per_second=1.42608G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 5572 ns 5576 ns 125720 bytes_per_second=2.70428Gi/s items_per_second=1.45185G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 5612 ns 5615 ns 128266 bytes_per_second=2.68549Gi/s items_per_second=1.44176G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 5511 ns 5516 ns 125777 bytes_per_second=2.73406Gi/s items_per_second=1.46784G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 5626 ns 5630 ns 127101 bytes_per_second=2.67841Gi/s items_per_second=1.43796G/s
avx2 doesn't have __m256i _mm256_sllv_epi16 (__m256i a, __m256i count)
, it's introduced in avx512, so it's even slower... So a shift way for int16 not works without avx512 :-(
# --- demo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1 rdtsc_overhead set to 42
1.14 1.55
2 1.12 1.50
3 1.19 1.55
4 0.88 1.92
5 1.11 1.70
6 1.20 1.67
7 1.34 1.70
8 1.03 1.91
9 1.53 1.70
10 1.58 1.73
11 1.67 1.73
12 1.62 1.97
13 1.73 1.83
14 1.75 1.89
15 1.62 1.91
16 0.41 1.95
17 1.86 1.94
18 1.83 2.02
19 1.91 1.97
20 1.75 2.19
21 1.94 2.14
22 1.69 2.14
23 1.78 2.19
24 1.34 2.20
25 1.70 2.36
26 1.73 2.36
27 1.75 2.34
28 1.58 2.62
29 1.78 2.20
30 1.86 2.31
31 1.97 2.31
32 0.36 0.34
# --- turbodemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1 1.00 1.92
2 1.00 1.56
3 1.00 1.55
4 1.36 1.53
5 1.00 1.59
6 1.00 1.38
7 1.08 1.70
8 1.14 1.91
9 1.28 1.92
10 1.23 1.72
11 1.33 1.94
12 1.30 1.67
13 1.33 1.95
14 1.33 1.73
15 1.38 1.95
16 1.14 1.91
17 1.42 1.91
18 1.31 1.70
19 1.42 1.91
20 1.33 1.72
21 1.50 1.92
22 1.56 1.75
23 1.69 1.95
24 1.39 1.95
25 1.83 1.95
26 1.81 1.83
27 1.78 1.95
28 1.62 1.88
29 1.92 1.97
30 1.94 1.92
31 1.97 2.02
32 0.36 1.77
# --- bmidemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1 1.00 0.47
2 1.00 0.45
3 1.00 0.50
4 1.36 0.45
5 1.00 0.53
6 1.00 1.36
7 1.08 1.70
8 1.14 1.91
9 1.28 1.94
10 1.23 1.72
11 1.33 1.94
12 1.30 1.67
13 1.33 1.95
14 1.33 1.73
15 1.38 1.95
16 1.14 1.91
17 1.42 1.92
18 1.31 1.69
19 1.42 1.91
20 1.33 1.70
21 1.50 1.94
22 1.56 1.75
23 1.69 1.95
24 1.39 1.94
25 1.83 1.94
26 1.81 1.83
27 1.78 1.95
28 1.62 1.88
29 1.92 1.97
30 1.94 1.92
31 1.97 2.03
32 0.36 1.75
# --- horizontaldemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1 1.16 0.66
2 1.12 0.66
3 1.19 0.66
4 0.88 0.69
5 1.11 0.72
6 1.20 0.70
7 1.34 0.67
8 1.03 0.34
9 1.52 0.69
10 1.56 0.67
11 1.67 0.67
12 1.62 0.69
13 1.73 0.70
14 1.75 0.69
15 1.62 0.67
16 0.41 0.34
17 1.86 0.69
18 1.83 0.69
19 1.91 0.70
20 1.75 0.69
21 1.94 0.70
22 1.67 0.70
23 1.78 0.69
24 1.34 0.34
25 1.72 0.69
26 1.73 0.69
27 1.75 0.89
28 1.58 0.69
29 1.78 1.03
30 1.86 1.09
31 1.97 1.05
32 0.36 0.33
# --- scdemo128
# compressing 128 integers
# format: bit width, pack in cycles per int, unpack in cycles per int
1 1.00 1.14
2 1.00 0.83
3 1.00 1.11
4 1.34 0.84
5 1.00 1.19
6 1.00 1.20
7 1.08 1.23
8 1.14 1.14
9 1.28 1.27
10 1.23 1.28
11 1.33 1.31
12 1.30 1.28
13 1.33 1.34
14 1.33 1.34
15 1.38 1.38
16 1.12 1.12
17 1.42 1.39
18 1.31 1.41
19 1.42 1.44
20 1.33 1.42
21 1.50 1.50
22 1.55 1.50
23 1.69 1.55
24 1.39 1.44
25 1.81 1.58
26 1.83 1.58
27 1.78 1.61
28 1.62 1.59
29 1.91 1.47
30 1.92 1.67
31 1.97 1.70
32 0.36 0.36
This benchmark runs benchmarks in Intel Xeon runs LittleIntPacker
, seems bmi2 is faster ( author only implement 1-5 bitwidths for bmi2)
I write a naive bmi2 impl, in Intel Xeon:
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 4729 ns 4734 ns 149748 bytes_per_second=3.18578Gi/s items_per_second=1.71035G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 14434 ns 14453 ns 48206 bytes_per_second=1.04339Gi/s items_per_second=560.166M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 2763 ns 2769 ns 251599 bytes_per_second=5.44634Gi/s items_per_second=2.92398G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 4311 ns 4316 ns 162065 bytes_per_second=3.49358Gi/s items_per_second=1.8756G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 4122 ns 4125 ns 169853 bytes_per_second=3.65559Gi/s items_per_second=1.96258G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 3886 ns 3888 ns 179847 bytes_per_second=3.8783Gi/s items_per_second=2.08215G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 13068 ns 13080 ns 53376 bytes_per_second=1.15291Gi/s items_per_second=618.963M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 3742 ns 3748 ns 186340 bytes_per_second=4.02357Gi/s items_per_second=2.16014G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 3745 ns 3750 ns 186374 bytes_per_second=4.02114Gi/s items_per_second=2.15883G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 3742 ns 3747 ns 186728 bytes_per_second=4.02498Gi/s items_per_second=2.1609G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 3424 ns 3429 ns 204069 bytes_per_second=4.39745Gi/s items_per_second=2.36086G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 3499 ns 3504 ns 199696 bytes_per_second=4.30356Gi/s items_per_second=2.31045G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 3311 ns 3315 ns 208379 bytes_per_second=4.54882Gi/s items_per_second=2.44213G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 3499 ns 3506 ns 199599 bytes_per_second=4.3016Gi/s items_per_second=2.3094G/s
Before:
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 5534 ns 5541 ns 124334 bytes_per_second=2.7213Gi/s items_per_second=1.46099G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 16683 ns 16703 ns 42139 bytes_per_second=924.523Mi/s items_per_second=484.716M/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 2594 ns 2597 ns 269554 bytes_per_second=5.80692Gi/s items_per_second=3.11757G/s
ReadLevels_Rle/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 4980 ns 4985 ns 140579 bytes_per_second=3.02534Gi/s items_per_second=1.62421G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 4782 ns 4786 ns 146402 bytes_per_second=3.15082Gi/s items_per_second=1.69158G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 4397 ns 4402 ns 159011 bytes_per_second=3.42562Gi/s items_per_second=1.83912G/s
ReadLevels_Rle/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 15721 ns 15736 ns 44453 bytes_per_second=981.292Mi/s items_per_second=514.48M/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 3272 ns 3274 ns 213994 bytes_per_second=4.60583Gi/s items_per_second=2.47273G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 3272 ns 3273 ns 213769 bytes_per_second=4.60763Gi/s items_per_second=2.4737G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1024 3272 ns 3272 ns 213713 bytes_per_second=4.60907Gi/s items_per_second=2.47447G/s
ReadLevels_BitPack/MaxLevel:1/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 3270 ns 3273 ns 213934 bytes_per_second=4.60708Gi/s items_per_second=2.47341G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:1 3314 ns 3321 ns 210344 bytes_per_second=4.5405Gi/s items_per_second=2.43766G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:2048/LevelRepeatCount:1 3273 ns 3279 ns 213189 bytes_per_second=4.5986Gi/s items_per_second=2.46885G/s
ReadLevels_BitPack/MaxLevel:3/NumLevels:8096/BatchSize:1024/LevelRepeatCount:7 3310 ns 3318 ns 211121 bytes_per_second=4.54534Gi/s items_per_second=2.44026G/s
In the senerio of Rle Read levels, performance grows faster, but in BitPack, it even grows slower. I guess it could benifit performance when number of input is small
TBD: I'll test auto-vectorize like arrow-rs and impala. And investigate how to make full use of bmi2, since it seems works well when number of input is small
Describe the enhancement requested
Parquet C++ uses RLE encoding to encode the rep-level and def-levels. Benchmark can be seen here: https://github.com/apache/arrow/pull/39705#issuecomment-1921637432
Benchmark shows that, RLE would be a bit slower than BitPack. Besides, BitPack is also slow here, see https://github.com/apache/arrow/issues/39227#issue-2041139028 . Arrow don't have native
unpack16
, it change to unpack32 and memcpy 32 to 16. So, we have a wide space to optimize Parquet level decoding.Currently, a RleDecoder ( https://github.com/apache/arrow/blob/dc2c5c66f5234a92169da76613399135786dbffb/cpp/src/arrow/util/rle_encoding.h#L88 ) is used. ( Maybe I should add a benchmark for that ? see: https://github.com/apache/arrow/issues/39630 ). I tried to optimize that but I found it's a bit hard to do so. Currently, bottleneck might be "find-each-run", which is hard to predict the pipeline
We can try different way to optimize level decoding:
Component(s)
C++, Parquet