apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.38k stars 1.26k forks source link

Reduce heap usage when building realtime segments #4036

Open mcvsubbu opened 5 years ago

mcvsubbu commented 5 years ago

Reducing heap usage while building completed segments. Currently, the segment builder is designed to read incoming data row by row, and build dictionaries in a hash table before translating them to the on-disk format of a dictionary. We can by-pass these steps since we already have the segment in columnar format (realtime consumers ingest rows but store in a columnar format for serving queries). Initial prototype has shown significant reduction in heap usage during segment builds. If we reduce heap usage (better yet, move completely to off-heap based segment completion) more segments can be packed into a single host, saving hardware cost. If a higher latency can be tolerated, these hosts could use SSDs and map off-heap memory from files (Pinot already provides primitives for doing these)

Prototype: https://github.com/mcvsubbu/incubator-pinot/commit/c866d9130a5ceddb7a25c3235d605e503e91e13c

kishoreg commented 5 years ago

Very nice. which data structures contributed to the heap usage. Is it mostly hashmap dictionary?

mcvsubbu commented 5 years ago

I ran the prototype code on sample segments of 4 use cases in our cluster. In two of the use cases there was about 30% heap usage reduction. In the third use case there was a 7% heap usage increase, but I believe we should be able to find ways to fix this. In the fourth use case, with the current segment build, one round of GC was triggered during the segment build, but no GC happened when segments were built using columnar storage. So, we know there is a win for this use case, just not how much and where.

The measurements were made by writing data from a realtime segment into a MutableSegment, and then running the segment builder with and without the columnar segment build prototype.

The commands 'jstat -gcutil' and 'jmap -histo' were used to measure the heap usage. The VM to build the segment was started with -Xmx32G -Xms32G -XX:NewSize=24g so that a GC is avoided.

Here are the results for the three use cases:

Savings (absolute)

           Case 1    Case 2    Case 3

[I (int[]) 173M 82M -39M [C char[]) 24M 147M 109M [B (byte[]) 23M 118M 183M [S (short[]) 6k 4k 5k [J (long[]) 5M 72M 7M [Z (boolean[]) 6M Integer 3M String 22M 129M 77M Long -12M -988M 96M Object 14M
KeyIterator 1M

Total Saved 260M (434M) 433M

Percentage 34% (13%) 25%

These are savings for one segment build. Typically a server has (configurable) number of parallel segment builds going on.

The segment build time also improved, although the improvement was not quantified

Descripton of use cases:

Case 1: 30 dimension columns, no metric columns Case 2: 19 dimensions, 5 metrics, 11 no dictionary columns of Long type. Case 3: 3 dimension colums

mcvsubbu commented 5 years ago

@kishoreg I could not format the table correctly but pls take a look

kishoreg commented 5 years ago

I am curious on knowing the heap usage if we use the offheap dictionary you built for realtime in segmentGenerator. Is that easy to try?

mcvsubbu commented 5 years ago

The prototype that I have is from before we moved to apache, so the changes are to com.linkedin classes. I am not actively working on this right now, but created the issue so that anyone interested can pick it up. We will get to it at some point.

That aside, I think the off-heap dictionary for realtime will not help reduce heap usage here. Most of the heap usage is in reading (from a row oriented format) and writing to a segment (column format), IIRC.