apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.43k stars 1.27k forks source link

Support variable length Offline Dictionary Indexes for bytes, strings and maps to save on storage #4317

Closed buchireddy closed 5 years ago

buchireddy commented 5 years ago

What? Currently, the dictionary index for offline segments for bytes and string types uses Fixed-size storage for each value (by picking the size of the max element and padding the smaller elements with "0"). See org.apache.pinot.core.io.util.FixedByteValueReaderWriter The idea is to avoid padding and support storing byte arrays/strings/maps of different length while not slowing down the lookups much (obviously).

Why? Fixed size based storage is good for fast lookups but it's very inefficient for the storage. For example, if we have a String column and the size of the biggest string value is 100 bytes but the average size is only 10 bytes, there is about 90% padding. The same thing applies for byte[], maps, etc.

How? Currently, FixedByteValueReaderWriter only writes the sorted values in the buffer directly starting from "0" offset and at fixed lengths. So, first Int is at index "0" and the second one at index "4", etc. There is no additional metadata needed in the buffer. The idea is to maintain the index of each element at the beginning of the buffer so that the element sizes needn't be fixed. When looking up an element from the buffer, we first get it's offset and then read the actual element. This means we do two reads from the buffer (first int offset and then the actual element) but the offset read should be fast enough so it shouldn't slow down the overall operation that much.

Few things to note:

Thanks @kishoreg for pointing this problem and brainstorming.

P.S: This was originally tracked in https://github.com/winedepot/pinot/issues/24

buchireddy commented 5 years ago

I've implemented a solution based on the approached discussed in the description and did some benchmarks with String dictionary to see the latency with VariableLength dictionary and the storage improvements it brings compared to the FixedLength dictionary. Here are the results and observations.

Time taken to lookup 10M values:

TimeVsDictionarySizesCharts

Observation: As can be clearly seen in the graph, as strings are getting bigger, the variable lengh dictionary is giving much better lookup latencies. When the dictionary size (cardinality) is >1M and the string sizes are small (<100), FixedLength dictionary has better lookup latencies though.

Storage requirements of VarLength dictionary: Since the variable length dictionary doesn't do any padding, it saves the space for all the cases where the strings in the dictionary aren't of equal length. Hence, this graph plots the % storage savings with VarLength dictionary instead of absolute values.

DictSizeVsStorageSavingsChart

Observation: If the strings in the dictionary are of different lengths, VarLength dictionary saves 40% space compared to the fixed length dictionary.

Again thanks @kishoreg for all the guidance on this.

P.S: All raw values from the benchmarking are available at https://docs.google.com/spreadsheets/d/1iOLyhD4AUZw3JsdOkmH6h36KWYalVeUBIVcty6Pnv0E/edit?usp=sharing so feel free to copy/comment on the results.

kishoreg commented 5 years ago

Comment from @mcvsubbu @buchireddy did you consider the use of MutableOffheapByteArrayStore? I am not able to readily say whether your implementation is based on that or is something parallel.

kishoreg commented 5 years ago

From @buchireddy @mcvsubbu I did look at MutableOffheapByteArrayStore and took some concepts from it while implementing VarLengthBytesValueReaderWriter too. However, I haven't used the former fully because of the following reasons:

In order to support a seamless transition from fixed length to variable length offline dictionary, we need a special magic header to be added at the beginning, which needs some changes to MutableOffheapByteArrayStore. MutableOffheapByteArrayStore is handling the buffer expansions, multiple buffers logic because it needs to be mutable but the immutable one doesn't need all those. So, it's a trade-off b/w simple code vs avoiding the code duplication (though we anyways need a new class for variable length dictionary class implementation). I'm open for debate and for further brainstorming.

P.S: This is my first time going through Pinot code so please let me know if I'm missing any aspects here.

kishoreg commented 5 years ago

From @mcvsubbu Mutable byte array store can be modified to have a header, no problem. It is currently used as a volatile store for comsuming segments, so changing it to include a header is not backward incompatible.

Why not modify it to have the header and also derive an unmutable class from it to be used during segment generation?

kishoreg commented 5 years ago

@mcvsubbu I had not looked into the details of MutableOffheapByteArrayStore. I do see some similarity between the two implementations. I also see some similarity with VarByteChunkWriter and VarByteChunkReader.

Looks like we have 3 classes of use cases

  1. Multiple Buffers, maintain offset index for each entry in the buffer (MutableByteArrayStore)
  2. Multiple Buffers, maintain offset index only at chunk level (Scan within the chunk) - VarByteChunkReader/Writer
  3. Single Buffer, maintain offset for each entry in the buffer (This is what @buchireddy has implemented)

Ideally, we should have a VarByte Indexed/Unindexed Reader/Writer with no notion of expansion. Multiple buffer implementations should just be wrappers on top of these things.

My suggestion is to implement the VarByteIndexedReaderWriter since we don't have such a primitive (This should look similar to the Buffer class within MutableOffHeapByteStore). In another PR, we can change MutableOffHeapByteStore to use the VarByteIndexedReaderWriter. Similarly, VarByteChunkReaderWriter should use VarByteUnIndexedReaderWriter implementation.

This will also allow us to support NoDictionary for BYTES data-type in real-time which is not currently supported.

Thoughts?

mcvsubbu commented 5 years ago

I am sure I am missing something, but I don't see why we should try to combine VarByteChunkReader into this logic. If we can take bay steps, at least we can avoid duplication between MutableOffheapByteArrayStore and whatever @buchireddy has implemented (haven't seen it yet, please point me to a PR).

We can then look into how other classes can fold into a single base class.

kishoreg commented 5 years ago

Take a look at the code within MutableOffHeapByteArrayStore.Buffer. That's exactly along the lines of what we need for a Var length Dictionary. But that's tightly coupled with MutableOffHeapByteArrayStore (e.g. it has the startIndex, which should have been managed within MutableOffHeapByteArrayStore).

There are few more things such as in real-time we don't know the number of entries upfront but in case of the immutable dictionary, we know the exact number of entries, size of each entry and total size. There might be some additional optimizations we can do based on these things. This is a classic case for cuckoo hashing as well.

One possibility is to extract out the MutableOffHeapByteArrayStore.Buffer inner class as VarByteReaderWriter.

Let's get this feature in without touching any of the existing functionalities. I can do the cleanup in another PR.

buchireddy commented 5 years ago

@mcvsubbu Here is the PR for this issue. https://github.com/apache/incubator-pinot/pull/4321

Ideally, we should have a VarByte Indexed/Unindexed Reader/Writer with no notion of expansion. Multiple buffer implementations should just be wrappers on top of these things.

@kishoreg For my own understanding, where do we need Unindexed Reader/Writer? Also, if the entries are variable length, you have to store the length for sure, if not index. And, the storage footprint might be on same lines whether we store index or length field because they're both going to be integers.

mcvsubbu commented 5 years ago

@kishoreg once you get the feature in, if you move to use the existing Buffer class, won't it be incompatbile? Isn't it better to just make the Buffer class public, do all the development using Buffer, and let the IDE pull out the Buffer class for free? That way you keep compatiblity across checkins

kishoreg commented 5 years ago

@mcvsubbu Here is the PR for this issue. #4321

Ideally, we should have a VarByte Indexed/Unindexed Reader/Writer with no notion of expansion. Multiple buffer implementations should just be wrappers on top of these things.

@kishoreg For my own understanding, where do we need Unindexed Reader/Writer? Also, if the entries are variable length, you have to store the length for sure, if not index. And, the storage footprint might be on same lines whether we store index or length field because they're both going to be integers.

You are right, we dont need UnIndexedReader/Writer. I was under the impression that we store (length, byte[]) (length, byte[]) in VarByteChunkWriter but it's similar what you have done for VarByteDictionary.

kishoreg commented 5 years ago

@kishoreg once you get the feature in, if you move to use the existing Buffer class, won't it be incompatbile? Isn't it better to just make the Buffer class public, do all the development using Buffer, and let the IDE pull out the Buffer class for free? That way you keep compatiblity across checkins

@mcvsubbu yes that's a possibility. Is that something you can work on? Might be easier for you to pull it out since you know the context. Once you pull it out @buchireddy can update this PR.

buchireddy commented 5 years ago

Merged this feature as part of https://github.com/apache/incubator-pinot/pull/4321 and hence closing the issue.