apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

Benchmark data table serialization logic and pre-allocate byte[] array if need be #6714

Open mqliang opened 3 years ago

mqliang commented 3 years ago

As @siddharthteotia pointed out in https://github.com/apache/incubator-pinot/pull/6710#discussion_r599240463_

serialization functions first writes to a temporary output stream and then converts to byte array which is returned to the caller and written to the main stream. I think the reason for doing that is upfront we don't know the length of byte[] array to allocate.

However, we can probably do different and it might be faster

  • Write a loop to go over each entry and keep a running sum of size
  • At the end of loop, allocate byte array of that size
  • Start another loop and go over each entry again and fill out the pre-allocated byte array.
  • Return the filled byte array

We need to benchmark this two serialization approach. If the proposed approach is better, will send a PR to address it.

mqliang commented 3 years ago

I write a benchmark here: https://github.com/mqliang/pinot/commit/7892423579b20dafcb5802a09f20f826377f6c39

The benchmark compares three serialization methods (serialize a typical metadata map):

Here is the result:

# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=65146:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 5 iterations, 30 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative

# Run progress: 0.00% complete, ETA 00:08:00
# Fork: 1 of 1
# Warmup Iteration   1: 552.178 us/op
Iteration   1: 519.531 us/op
                 ·gc.alloc.rate:                   3270.480 MB/sec
                 ·gc.alloc.rate.norm:              1811608.009 B/op
                 ·gc.churn.PS_Eden_Space:          3275.114 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1814175.318 B/op
                 ·gc.churn.PS_Survivor_Space:      0.558 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 309.168 B/op
                 ·gc.count:                        525.000 counts
                 ·gc.time:                         261.000 ms

Iteration   2: 524.659 us/op
                 ·gc.alloc.rate:                   3238.871 MB/sec
                 ·gc.alloc.rate.norm:              1811608.011 B/op
                 ·gc.churn.PS_Eden_Space:          3242.901 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1813862.347 B/op
                 ·gc.churn.PS_Survivor_Space:      0.563 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 314.968 B/op
                 ·gc.count:                        516.000 counts
                 ·gc.time:                         263.000 ms

Iteration   3: 526.323 us/op
                 ·gc.alloc.rate:                   3228.230 MB/sec
                 ·gc.alloc.rate.norm:              1811608.008 B/op
                 ·gc.churn.PS_Eden_Space:          3232.024 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1813736.682 B/op
                 ·gc.churn.PS_Survivor_Space:      0.471 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 264.539 B/op
                 ·gc.count:                        470.000 counts
                 ·gc.time:                         254.000 ms

Iteration   4: 521.779 us/op
                 ·gc.alloc.rate:                   3256.320 MB/sec
                 ·gc.alloc.rate.norm:              1811608.008 B/op
                 ·gc.churn.PS_Eden_Space:          3261.433 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1814452.617 B/op
                 ·gc.churn.PS_Survivor_Space:      0.560 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 311.772 B/op
                 ·gc.count:                        534.000 counts
                 ·gc.time:                         270.000 ms

Iteration   5: 524.474 us/op
                 ·gc.alloc.rate:                   3239.855 MB/sec
                 ·gc.alloc.rate.norm:              1811608.008 B/op
                 ·gc.churn.PS_Eden_Space:          3242.045 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1812832.659 B/op
                 ·gc.churn.PS_Survivor_Space:      0.547 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 305.975 B/op
                 ·gc.count:                        483.000 counts
                 ·gc.time:                         255.000 ms

Result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative":
  523.353 ±(99.9%) 10.345 us/op [Average]
  (min, avg, max) = (519.531, 523.353, 526.323), stdev = 2.687
  CI (99.9%): [513.008, 533.698] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.alloc.rate":
  3246.751 ±(99.9%) 64.066 MB/sec [Average]
  (min, avg, max) = (3228.230, 3246.751, 3270.480), stdev = 16.638
  CI (99.9%): [3182.685, 3310.818] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.alloc.rate.norm":
  1811608.009 ±(99.9%) 0.005 B/op [Average]
  (min, avg, max) = (1811608.008, 1811608.009, 1811608.011), stdev = 0.001
  CI (99.9%): [1811608.003, 1811608.014] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Eden_Space":
  3250.704 ±(99.9%) 66.578 MB/sec [Average]
  (min, avg, max) = (3232.024, 3250.704, 3275.114), stdev = 17.290
  CI (99.9%): [3184.126, 3317.282] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Eden_Space.norm":
  1813811.924 ±(99.9%) 2365.646 B/op [Average]
  (min, avg, max) = (1812832.659, 1813811.924, 1814452.617), stdev = 614.351
  CI (99.9%): [1811446.279, 1816177.570] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Survivor_Space":
  0.540 ±(99.9%) 0.150 MB/sec [Average]
  (min, avg, max) = (0.471, 0.540, 0.563), stdev = 0.039
  CI (99.9%): [0.390, 0.690] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Survivor_Space.norm":
  301.285 ±(99.9%) 80.118 B/op [Average]
  (min, avg, max) = (264.539, 301.285, 314.968), stdev = 20.806
  CI (99.9%): [221.166, 381.403] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.count":
  2528.000 ±(99.9%) 0.001 counts [Sum]
  (min, avg, max) = (470.000, 505.600, 534.000), stdev = 27.700
  CI (99.9%): [2528.000, 2528.000] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.time":
  1303.000 ±(99.9%) 0.001 ms [Sum]
  (min, avg, max) = (254.000, 260.600, 270.000), stdev = 6.504
  CI (99.9%): [1303.000, 1303.000] (assumes normal distribution)

# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=65146:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 5 iterations, 30 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache

# Run progress: 33.33% complete, ETA 00:05:28
# Fork: 1 of 1
# Warmup Iteration   1: 390.616 us/op
Iteration   1: 375.676 us/op
                 ·gc.alloc.rate:                   3524.091 MB/sec
                 ·gc.alloc.rate.norm:              1411608.008 B/op
                 ·gc.churn.PS_Eden_Space:          3532.601 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1415016.587 B/op
                 ·gc.churn.PS_Survivor_Space:      0.538 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 215.400 B/op
                 ·gc.count:                        458.000 counts
                 ·gc.time:                         248.000 ms

Iteration   2: 375.171 us/op
                 ·gc.alloc.rate:                   3528.907 MB/sec
                 ·gc.alloc.rate.norm:              1411608.006 B/op
                 ·gc.churn.PS_Eden_Space:          3534.356 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1413787.624 B/op
                 ·gc.churn.PS_Survivor_Space:      0.494 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 197.609 B/op
                 ·gc.count:                        435.000 counts
                 ·gc.time:                         247.000 ms

Iteration   3: 373.233 us/op
                 ·gc.alloc.rate:                   3547.720 MB/sec
                 ·gc.alloc.rate.norm:              1411608.005 B/op
                 ·gc.churn.PS_Eden_Space:          3559.728 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1416385.929 B/op
                 ·gc.churn.PS_Survivor_Space:      0.539 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 214.343 B/op
                 ·gc.count:                        462.000 counts
                 ·gc.time:                         247.000 ms

Iteration   4: 371.186 us/op
                 ·gc.alloc.rate:                   3567.068 MB/sec
                 ·gc.alloc.rate.norm:              1411608.006 B/op
                 ·gc.churn.PS_Eden_Space:          3566.702 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1411463.405 B/op
                 ·gc.churn.PS_Survivor_Space:      0.597 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 236.411 B/op
                 ·gc.count:                        520.000 counts
                 ·gc.time:                         271.000 ms

Iteration   5: 370.738 us/op
                 ·gc.alloc.rate:                   3571.354 MB/sec
                 ·gc.alloc.rate.norm:              1411608.005 B/op
                 ·gc.churn.PS_Eden_Space:          3582.874 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1416161.234 B/op
                 ·gc.churn.PS_Survivor_Space:      0.588 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 232.322 B/op
                 ·gc.count:                        509.000 counts
                 ·gc.time:                         262.000 ms

Result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache":
  373.201 ±(99.9%) 8.639 us/op [Average]
  (min, avg, max) = (370.738, 373.201, 375.676), stdev = 2.243
  CI (99.9%): [364.562, 381.840] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.alloc.rate":
  3547.828 ±(99.9%) 82.702 MB/sec [Average]
  (min, avg, max) = (3524.091, 3547.828, 3571.354), stdev = 21.477
  CI (99.9%): [3465.126, 3630.530] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.alloc.rate.norm":
  1411608.006 ±(99.9%) 0.005 B/op [Average]
  (min, avg, max) = (1411608.005, 1411608.006, 1411608.008), stdev = 0.001
  CI (99.9%): [1411608.001, 1411608.011] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Eden_Space":
  3555.252 ±(99.9%) 83.120 MB/sec [Average]
  (min, avg, max) = (3532.601, 3555.252, 3582.874), stdev = 21.586
  CI (99.9%): [3472.132, 3638.373] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Eden_Space.norm":
  1414562.956 ±(99.9%) 7771.212 B/op [Average]
  (min, avg, max) = (1411463.405, 1414562.956, 1416385.929), stdev = 2018.159
  CI (99.9%): [1406791.744, 1422334.168] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Survivor_Space":
  0.551 ±(99.9%) 0.162 MB/sec [Average]
  (min, avg, max) = (0.494, 0.551, 0.597), stdev = 0.042
  CI (99.9%): [0.389, 0.713] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Survivor_Space.norm":
  219.217 ±(99.9%) 60.046 B/op [Average]
  (min, avg, max) = (197.609, 219.217, 236.411), stdev = 15.594
  CI (99.9%): [159.171, 279.263] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.count":
  2384.000 ±(99.9%) 0.001 counts [Sum]
  (min, avg, max) = (435.000, 476.800, 520.000), stdev = 36.134
  CI (99.9%): [2384.000, 2384.000] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.time":
  1275.000 ±(99.9%) 0.001 ms [Sum]
  (min, avg, max) = (247.000, 255.000, 271.000), stdev = 10.977
  CI (99.9%): [1275.000, 1275.000] (assumes normal distribution)

# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=65146:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 5 iterations, 30 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream

# Run progress: 66.67% complete, ETA 00:02:44
# Fork: 1 of 1
# Warmup Iteration   1: 483.366 us/op
Iteration   1: 408.351 us/op
                 ·gc.alloc.rate:                   3580.758 MB/sec
                 ·gc.alloc.rate.norm:              1558808.007 B/op
                 ·gc.churn.PS_Eden_Space:          3586.078 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1561123.846 B/op
                 ·gc.churn.PS_Survivor_Space:      0.511 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 222.603 B/op
                 ·gc.count:                        476.000 counts
                 ·gc.time:                         253.000 ms

Iteration   2: 410.342 us/op
                 ·gc.alloc.rate:                   3563.256 MB/sec
                 ·gc.alloc.rate.norm:              1558808.009 B/op
                 ·gc.churn.PS_Eden_Space:          3569.765 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1561655.686 B/op
                 ·gc.churn.PS_Survivor_Space:      0.451 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 197.394 B/op
                 ·gc.count:                        409.000 counts
                 ·gc.time:                         244.000 ms

Iteration   3: 407.314 us/op
                 ·gc.alloc.rate:                   3589.291 MB/sec
                 ·gc.alloc.rate.norm:              1558808.006 B/op
                 ·gc.churn.PS_Eden_Space:          3592.335 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1560130.076 B/op
                 ·gc.churn.PS_Survivor_Space:      0.557 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 241.833 B/op
                 ·gc.count:                        495.000 counts
                 ·gc.time:                         261.000 ms

Iteration   4: 407.294 us/op
                 ·gc.alloc.rate:                   3590.035 MB/sec
                 ·gc.alloc.rate.norm:              1558808.006 B/op
                 ·gc.churn.PS_Eden_Space:          3595.643 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1561243.143 B/op
                 ·gc.churn.PS_Survivor_Space:      0.439 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 190.513 B/op
                 ·gc.count:                        382.000 counts
                 ·gc.time:                         239.000 ms

Iteration   5: 410.068 us/op
                 ·gc.alloc.rate:                   3565.783 MB/sec
                 ·gc.alloc.rate.norm:              1558808.006 B/op
                 ·gc.churn.PS_Eden_Space:          3576.571 MB/sec
                 ·gc.churn.PS_Eden_Space.norm:     1563524.046 B/op
                 ·gc.churn.PS_Survivor_Space:      0.542 MB/sec
                 ·gc.churn.PS_Survivor_Space.norm: 236.741 B/op
                 ·gc.count:                        460.000 counts
                 ·gc.time:                         252.000 ms

Result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream":
  408.674 ±(99.9%) 5.641 us/op [Average]
  (min, avg, max) = (407.294, 408.674, 410.342), stdev = 1.465
  CI (99.9%): [403.033, 414.314] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.alloc.rate":
  3577.824 ±(99.9%) 48.952 MB/sec [Average]
  (min, avg, max) = (3563.256, 3577.824, 3590.035), stdev = 12.713
  CI (99.9%): [3528.873, 3626.776] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.alloc.rate.norm":
  1558808.007 ±(99.9%) 0.005 B/op [Average]
  (min, avg, max) = (1558808.006, 1558808.007, 1558808.009), stdev = 0.001
  CI (99.9%): [1558808.002, 1558808.011] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Eden_Space":
  3584.078 ±(99.9%) 41.614 MB/sec [Average]
  (min, avg, max) = (3569.765, 3584.078, 3595.643), stdev = 10.807
  CI (99.9%): [3542.465, 3625.692] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Eden_Space.norm":
  1561535.360 ±(99.9%) 4793.590 B/op [Average]
  (min, avg, max) = (1560130.076, 1561535.360, 1563524.046), stdev = 1244.880
  CI (99.9%): [1556741.769, 1566328.950] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Survivor_Space":
  0.500 ±(99.9%) 0.204 MB/sec [Average]
  (min, avg, max) = (0.439, 0.500, 0.557), stdev = 0.053
  CI (99.9%): [0.296, 0.704] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Survivor_Space.norm":
  217.817 ±(99.9%) 88.656 B/op [Average]
  (min, avg, max) = (190.513, 217.817, 241.833), stdev = 23.024
  CI (99.9%): [129.161, 306.473] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.count":
  2222.000 ±(99.9%) 0.001 counts [Sum]
  (min, avg, max) = (382.000, 444.400, 495.000), stdev = 47.300
  CI (99.9%): [2222.000, 2222.000] (assumes normal distribution)

Secondary result "org.apache.pinot.perf.BenchmarkDataTableSerialization.temporaryOutputStream:·gc.time":
  1249.000 ±(99.9%) 0.001 ms [Sum]
  (min, avg, max) = (239.000, 249.800, 261.000), stdev = 8.526
  CI (99.9%): [1249.000, 1249.000] (assumes normal distribution)

# Run complete. Total time: 00:08:12

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                                                            Mode  Cnt        Score      Error   Units
BenchmarkDataTableSerialization.preAllocateByteArrayNative                                           avgt    5      523.353 ±   10.345   us/op
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.alloc.rate                            avgt    5     3246.751 ±   64.066  MB/sec
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.alloc.rate.norm                       avgt    5  1811608.009 ±    0.005    B/op
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Eden_Space                   avgt    5     3250.704 ±   66.578  MB/sec
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Eden_Space.norm              avgt    5  1813811.924 ± 2365.646    B/op
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Survivor_Space               avgt    5        0.540 ±    0.150  MB/sec
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.churn.PS_Survivor_Space.norm          avgt    5      301.285 ±   80.118    B/op
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.count                                 avgt    5     2528.000             counts
BenchmarkDataTableSerialization.preAllocateByteArrayNative:·gc.time                                  avgt    5     1303.000                 ms
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache                                   avgt    5      373.201 ±    8.639   us/op
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.alloc.rate                    avgt    5     3547.828 ±   82.702  MB/sec
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.alloc.rate.norm               avgt    5  1411608.006 ±    0.005    B/op
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Eden_Space           avgt    5     3555.252 ±   83.120  MB/sec
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Eden_Space.norm      avgt    5  1414562.956 ± 7771.212    B/op
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Survivor_Space       avgt    5        0.551 ±    0.162  MB/sec
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.churn.PS_Survivor_Space.norm  avgt    5      219.217 ±   60.046    B/op
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.count                         avgt    5     2384.000             counts
BenchmarkDataTableSerialization.preAllocateByteArrayWithBytesCache:·gc.time                          avgt    5     1275.000                 ms
BenchmarkDataTableSerialization.temporaryOutputStream                                                avgt    5      408.674 ±    5.641   us/op
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.alloc.rate                                 avgt    5     3577.824 ±   48.952  MB/sec
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.alloc.rate.norm                            avgt    5  1558808.007 ±    0.005    B/op
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Eden_Space                        avgt    5     3584.078 ±   41.614  MB/sec
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Eden_Space.norm                   avgt    5  1561535.360 ± 4793.590    B/op
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Survivor_Space                    avgt    5        0.500 ±    0.204  MB/sec
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.churn.PS_Survivor_Space.norm               avgt    5      217.817 ±   88.656    B/op
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.count                                      avgt    5     2222.000             counts
BenchmarkDataTableSerialization.temporaryOutputStream:·gc.time                                       avgt    5     1249.000                 ms

Process finished with exit code 0

If my implementation is correct, benchmark result shows that using pre-allocate byte array with cache is slightly better than temporary output stream (10% faster -- 373.201 us/op VS. 408.674 us/op, use more memory of course to cache encoded KV, but GC time does not increased -- 1275ms VS 1249ms). It's easy to understand why preAllocateByteArrayNative is the worst one -- it encode K/V twice, whereas other two methods only encode K/V once.

Not sure whether we should do the change just for getting 10% improvement.