apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.45k stars 961 forks source link

[core] Update writer/reader benchmark for bloom filter #4350

Closed FangYongs closed 1 month ago

FangYongs commented 1 month ago

Purpose

Linked issue: close #xxx

We use expected rows(5000000) and fpp(0.01) in bloom filter, the writer results are as follows:

Records Value Data Size Without Bloom Filter(MS) With Bloom Filter(MS) Relative
100000 0B 6 9 1.5
100000 64B 11 13 1.181818182
100000 500B 37 34 0.918918919
100000 1000B 57 58 1.01754386
100000 2000B 88 95 1.079545455
1000000 0B 32 64 2
1000000 64B 74 114 1.540540541
1000000 500B 253 310 1.225296443
1000000 1000B 384 440 1.145833333
1000000 2000B 534 625 1.170411985
5000000 0B 164 327 1.993902439
5000000 64B 356 511 1.435393258
5000000 500B 1072 1197 1.116604478
5000000 1000B 1921 2338 1.21707444
5000000 2000B 3163 3480 1.100221309
10000000 0B 344 733 2.130813953
10000000 64B 659 1220 1.851289833
10000000 500B 2815 2991 1.062522202
10000000 1000B 4914 5407 1.1003256
10000000 2000B 8646 9170 1.060606061
15000000 0B 517 1028 1.988394584
15000000 64B 1169 1718 1.469632164
15000000 500B 5435 5170 0.95124195
15000000 1000B 9962 10255 1.029411765
15000000 2000B 16203 18573 1.146269209
The reader results which query data based on keys that are definitely stored are as follows and it indicates that the bloom filter basically does not cause performance degradation. Records Value Data Size Without Bloom Filter(MS) With Bloom Filter(MS) Relative
100000 0B 799 807 1.010012516
100000 64B 205 193 0.941463415
100000 500B 171 140 0.81871345
100000 1000B 168 187 1.113095238
100000 2000B 170 173 1.017647059
1000000 0B 791 803 1.01517067
1000000 64B 221 211 0.954751131
1000000 500B 145 163 1.124137931
1000000 1000B 152 181 1.190789474
1000000 2000B 162 164 1.012345679
5000000 0B 789 778 0.986058302
5000000 64B 221 224 1.013574661
5000000 500B 175 161 0.92
5000000 1000B 178 181 1.016853933
5000000 2000B 186 186 1
10000000 0B 776 797 1.027061856
10000000 64B 191 201 1.052356021
10000000 500B 142 151 1.063380282
10000000 1000B 142 159 1.11971831
10000000 2000B 159 166 1.044025157
15000000 0B 820 800 0.975609756
15000000 64B 260 209 0.803846154
15000000 500B 139 151 1.086330935
15000000 1000B 149 155 1.040268456
15000000 2000B 157 164 1.044585987

The reader results which query data based on keys that are definitely not stored are as follows and it indicates that the bloom filter can greatly improve performance.

Records Value Data Size Without Bloom Filter(MS) With Bloom Filter(MS) Relative
100000 0B 6 3 0.5
100000 64B 3 2 0.666666667
100000 500B 3 1 0.333333333
100000 1000B 4 2 0.5
100000 2000B 4 2 0.5
1000000 0B 3 2 0.666666667
1000000 64B 3 2 0.666666667
1000000 500B 3 2 0.666666667
1000000 1000B 4 2 0.5
1000000 2000B 6 3 0.5
5000000 0B 3 3 1
5000000 64B 3 2 0.666666667
5000000 500B 5 3 0.6
5000000 1000B 6 3 0.5
5000000 2000B 10 5 0.5
10000000 0B 4 3 0.75
10000000 64B 5 4 0.8
10000000 500B 6 4 0.666666667
10000000 1000B 7 5 0.714285714
10000000 2000B 11 8 0.727272727
15000000 0B 5 4 0.8
15000000 64B 5 4 0.8
15000000 500B 7 6 0.857142857
15000000 1000B 8 8 1
15000000 2000B 14 13 0.928571429
FangYongs commented 1 month ago

@JingsongLi Please help to review this PR, thanks

JingsongLi commented 1 month ago

+1