apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.3k stars 1.23k forks source link

varLengthStringDictionary outperforms fixedLengthStringDictionary for `dictionary.indexOf()` #10008

Open snleee opened 1 year ago

snleee commented 1 year ago

https://github.com/apache/pinot/pull/10007

From the test above, varLengthStringDictionary outperforms fixedSizeStringDictionary for dictionary.indexOf() calls and this is not intuitive because varLengthDictionary has 1 extra level of indirection. We can investigate and understand why this is the case.

BenchmarkStringVarLengthDictionary.fixedStringDictionaryIndexOf      avgt    5  2802.320 ± 308.715  ms/op
BenchmarkStringVarLengthDictionary.fixedStringDictionaryGet        avgt    5   213.743 ±  15.349  ms/op
BenchmarkStringVarLengthDictionary.varLengthStringDictionaryIndexOf  avgt    5  2079.673 ±  78.193  ms/op
BenchmarkStringVarLengthDictionary.varLengthStringDictionaryGet    avgt    5   227.187 ±  13.771  ms/op
Jackie-Jiang commented 1 year ago

One main difference between current fixed-length and var-length string dictionary is that for fixed-length one we have to scan for the padding bytes, which is not required for the var-length string dictionary. We should run this benchmark on strings with different length distribution, and var-length one should outperform fixed-length when the length is skewed (max length >> min length).

snleee commented 1 year ago

I think that we also need to run the benchmark after simulating the case where the memory is not enough to keep the entire dictionary and need to read data from the disk.