Open snleee opened 1 year ago
One main difference between current fixed-length and var-length string dictionary is that for fixed-length one we have to scan for the padding bytes, which is not required for the var-length string dictionary. We should run this benchmark on strings with different length distribution, and var-length one should outperform fixed-length when the length is skewed (max length >> min length).
I think that we also need to run the benchmark after simulating the case where the memory is not enough to keep the entire dictionary and need to read data from the disk.
https://github.com/apache/pinot/pull/10007
From the test above,
varLengthStringDictionary
outperformsfixedSizeStringDictionary
fordictionary.indexOf()
calls and this is not intuitive because varLengthDictionary has 1 extra level of indirection. We can investigate and understand why this is the case.