Benchmark setColumn() in DataTableBuilder and write all values one by one instead of using rowId and colId to position if need be

mqliang commented 3 years ago

As pointed by a TODO in DataTableBuilder: https://github.com/apache/incubator-pinot/blob/e892cb2942f1f41ac7056d977eeca0e5b22897d0/pinot-core/src/main/java/org/apache/pinot/core/common/datatable/DataTableBuilder.java#L77

Currently, setColumn() use rowId&colId to first locate the offset of ByteBuffer, the write the actual value to the offset. This allowed us to set the value for a given colId, but have a lot of overhead of calling ByteBuffer.position(). In practice, we often build a data table row in a bulk way, which means we can write all values of a row one by one, without calling ByteBuffer.position() for each column.

We need to benchmark those two approaches (it's possible that the overhead of calling ByteBuffer.position() is negligible). If the bulk way is better, will send a PR to address it.

mqliang commented 3 years ago

I write a bench mark here: https://github.com/mqliang/pinot/commit/a32a61aad5dfa6b6c4a09064c75926b00495cd3a

The benchmark compares three ways to build a data table:

BenchmarkDataTableRowIdColIdBuildInOrder: for each row, call dataTableBuilder.setColumn(colId, value); to set the value for each column. However, set value of columns in order of: 1st col, 2nd col, 3nd col...
BenchmarkDataTableRowIdColIdBuildRandomOrder: call dataTableBuilder.setColumn(colId, value); to set the value for each column. However, set value of columns in ramdom order, e.g. in order of : 11st col, 20st col, 3nd col, 5st col...
BenchmarkDataTableRowBulkBuild: for each row, first put values of all column into a Object[] array, then build the row in bulk -- write all values one by one, without calling ByteBuffer.position() at all.

The result of building a table of 100 rows is:

# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=56968:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 5 iterations, 30 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.pinot.perf.BenchmarkDataTableBulkBuild.BenchmarkDataTableRowBulkBuild

# Run progress: 0.00% complete, ETA 00:08:00
# Fork: 1 of 1
# Warmup Iteration   1: 198.463 us/op
Iteration   1: 157.345 us/op
Iteration   2: 157.778 us/op
Iteration   3: 155.212 us/op
Iteration   4: 154.870 us/op
Iteration   5: 153.947 us/op

Result "org.apache.pinot.perf.BenchmarkDataTableBulkBuild.BenchmarkDataTableRowBulkBuild":
  155.830 ±(99.9%) 6.368 us/op [Average]
  (min, avg, max) = (153.947, 155.830, 157.778), stdev = 1.654
  CI (99.9%): [149.462, 162.198] (assumes normal distribution)

# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=56968:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 5 iterations, 30 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.pinot.perf.BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildInOrder

# Run progress: 33.33% complete, ETA 00:05:21
# Fork: 1 of 1
# Warmup Iteration   1: 193.779 us/op
Iteration   1: 150.726 us/op
Iteration   2: 150.649 us/op
Iteration   3: 151.587 us/op
Iteration   4: 151.765 us/op
Iteration   5: 151.749 us/op

Result "org.apache.pinot.perf.BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildInOrder":
  151.295 ±(99.9%) 2.155 us/op [Average]
  (min, avg, max) = (150.649, 151.295, 151.765), stdev = 0.560
  CI (99.9%): [149.140, 153.451] (assumes normal distribution)

# JMH version: 1.26
# VM version: JDK 1.8.0_282, OpenJDK 64-Bit Server VM, 25.282-b08
# VM invoker: /Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/bin/java
# VM options: -javaagent:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/lib/idea_rt.jar=56968:/Users/mqliang/Library/Application Support/JetBrains/Toolbox/apps/IDEA-U/ch-0/203.7717.56/IntelliJ IDEA.app/Contents/bin -Dfile.encoding=UTF-8
# Warmup: 1 iterations, 10 s each
# Measurement: 5 iterations, 30 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.apache.pinot.perf.BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildRandomOrder

# Run progress: 66.67% complete, ETA 00:02:40
# Fork: 1 of 1
# Warmup Iteration   1: 216.635 us/op
Iteration   1: 175.108 us/op
Iteration   2: 174.428 us/op
Iteration   3: 178.706 us/op
Iteration   4: 180.284 us/op
Iteration   5: 178.219 us/op

Result "org.apache.pinot.perf.BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildRandomOrder":
  177.349 ±(99.9%) 9.581 us/op [Average]
  (min, avg, max) = (174.428, 177.349, 180.284), stdev = 2.488
  CI (99.9%): [167.768, 186.930] (assumes normal distribution)

# Run complete. Total time: 00:08:02

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                                                                 Mode  Cnt    Score   Error  Units
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowBulkBuild                avgt    5  155.830 ± 6.368  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildInOrder      avgt    5  151.295 ± 2.155  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildRandomOrder  avgt    5  177.349 ± 9.581  us/op

Process finished with exit code 0

The result shows there is not significant difference between BenchmarkDataTableRowBulkBuild and BenchmarkDataTableRowIdColIdBuildInOrder. Which means: even if we use setColumn(colId, value) to build datatable, as long as we set values for columns in increasing order, not randomly, the overhead of calling ByteBuffer.position() is negligible.

Currently, our code base set values for columns in increasing order, so there is no need to address the TODO from the point of view of improving performance. But from the code cleaning point of view, we we can provide such a setColumnValuesInBulk() method, and change all current datatable building code to use setColumnValuesInBulk(). This way, our code is more self-explainable -- setting column value in bulk (in increasing order) is better than setting in random order. However, all setColumn(colId, value) methods should be kept, since there may be some circumstance we need to set a value for an arbitrary row/col.

cc @Jackie-Jiang

mqliang commented 3 years ago

@Jackie-Jiang It turns out there is a bug in the benchmark: for BenchmarkDataTableRowIdColIdBuildRandomOrder(), I also time the function of shuffling column ids. After moving the shuffling logic out of the function (see fix: https://github.com/mqliang/incubator-pinot/commit/c19b51947a5df804a3c819725f037fb1d6388062), the result becomes:

benchmark result of 100 rows table:

Benchmark                                                                 Mode  Cnt    Score   Error  Units
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowBulkBuild                avgt    5  147.196 ± 1.683  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildInOrder      avgt    5  146.157 ± 5.207  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildRandomOrder  avgt    5  145.171 ± 6.464  us/op

benchmark result of 1000 rows table:

Benchmark                                                                 Mode  Cnt     Score    Error  Units
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowBulkBuild                avgt    5  1612.386 ± 51.587  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildInOrder      avgt    5  1622.108 ± 65.567  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildRandomOrder  avgt    5  1610.517 ± 64.187  us/op

So, the overhead of calling ByteBuffer.position() is actually negligible.

@Jackie-Jiang Can you please take a look at my benchmark code? If the benchmark code is correct, we don't need to address the TODO. And I will send a PR to remove the TODO and write comments stating that such a optimization is unnecessary.

cc @siddharthteotia @mcvsubbu

Jackie-Jiang commented 3 years ago

This is awesome findings. Small suggestion to the benchmark is to return a value extracted from the data table to ensure the code is not optimized away by the JIT.

mqliang commented 2 years ago

With "return a value extracted from the data table to ensure the code is not optimized away by the JIT", code: https://github.com/mqliang/pinot/commit/a32a61aad5dfa6b6c4a09064c75926b00495cd3a#diff-ea557c4916a39b7358c4acd1d5f3e6c5677bf454fec3e2a3db9df65230931702

Benchmark                                                                 Mode  Cnt    Score    Error  Units
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowBulkBuild                avgt    5  176.261 ± 14.817  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildInOrder      avgt    5  179.805 ± 11.924  us/op
BenchmarkDataTableBulkBuild.BenchmarkDataTableRowIdColIdBuildRandomOrder  avgt    5  244.589 ± 10.396  us/op

From the benchmark result, the function of ByteBuffer.position() itself is very lightweight. So as long as caller of DataTableBuilder write all values one by one, there is no big difference. However if caller write values in a random order, the performance will decrease ---- the overhead is caused by "write buffer/cache line disruption", not by the ByteBuffer.position() function itself IMO.

I checked our code, all the caller write values one by one (columnId is monotonically increasing by 1 during building a DataTable), there is no use case of filling a DataTable in random order or replacing a value in DataTable with a given columnId. So here is my suggestion: replacing all the public void setColumn(int colId, Type value) functions as public void append(Type value).

The only challenge now is: for String type value, we use dictionary encoding and now DataTableV3 dictionary is per-column wise -- each column has a String->Int dictionary. So to append a String value, we must know the columnId (to lookup the dictionary for that column). This issue can be resolved after https://github.com/apache/pinot/pull/7167, where all columns share a common dictionary to encoding Strinng values.

apache / pinot

Benchmark setColumn() in DataTableBuilder and write all values one by one instead of using rowId and colId to position if need be #6720