chunk compression type is hardcoded to passthrough for metric columns

apache / pinot

Apache Pinot - A realtime distributed OLAP datastore

https://pinot.apache.org/

Apache License 2.0

5.39k stars 1.26k forks source link

chunk compression type is hardcoded to passthrough for metric columns #7973

Open ashishkf opened 2 years ago

ashishkf commented 2 years ago

There doesn't seem to be a way to use LZ4 compression for metric column.

https://github.com/apache/pinot/blob/f2f8e38f9424bcacf3946197c9afcd50ef1d58fa/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/RealtimeSegmentConverter.java#L100

richardstartin commented 2 years ago

This makes sense the way it is for a couple of reasons:

chunks for metric columns are tiny: 4-8KB depending on the data type. This means there would be many chunks to decompress in a column scan.
general purpose compression algorithms work better on text than arbitrary numeric data, so the compression ratio for the average user’s column likely wouldn’t be very good.

These two factors combine to make a less than compelling case for general purpose compression of metric columns.

There are numerous encoding techniques which could be explored for metric columns in the future, which tend to produce better space reductions and are faster to decode.

If you have a metric column which you expect to be compressible because it has lots of duplicates, it would be worth experimenting with using a dictionary column instead.

Jackie-Jiang commented 2 years ago

IMO this is a bug. For metrics, we use PASS_THROUGH by default, but should allow overriding it if it is explicitly configured in the FieldConfig

richardstartin commented 2 years ago

@Jackie-Jiang let's make it configurable when there are encoding modes which make sense for numeric data. LZ4 and Snappy aren't good options for numeric data, and are dominated by dictionary encoding.

richardstartin commented 2 years ago

Here are the sizes of 8KB (1024 doubles) of different distributions/patterns with Snappy and LZ4. There are encodings which can be introduced to reduce the size of metric columns (e.g. xor or delta encoding) but making it possible to compress metric column with general purpose compression algorithms isn't in the user's interest.

Compression	Distribution	Compressed Size (KB)
Uncompressed	integer increments	8.00
LZ4	integer increments	4.09
Snappy	integer increments	4.02
Uncompressed	noisy increments	8.00
LZ4	noisy increments	8.03
Snappy	noisy increments	8.00
Uncompressed	sinusoidal	8.00
LZ4	sinusoidal	8.03
Snappy	sinusoidal	8.00
Uncompressed	normal(0,1)	8.00
LZ4	normal(0,1)	8.03
Snappy	normal(0,1)	8.00
Uncompressed	exp(0.999)	8.00
LZ4	exp(0.999)	7.23
Snappy	exp(0.999)	7.16

ashishkf commented 2 years ago

I think in many cases the metric values don't change much - for example cpu usage gauge will have only slight variations for a given metric series over a small interval (say, 10 minutes). We have seen good compression ratios - in our data (storing Kubernetes metrics), we are getting 1 bytes per row instead of 8 allocated for the 'double' column. As a workaround we marked the value column as dimension to get the LZ4 compression to get the savings.

richardstartin commented 2 years ago

The problem is you won't get any savings from LZ4 - those CPU readings can be almost identical but with a little bit of noise the data is difficult for a text oriented algorithm like LZ4 to compress. The XOR of any two adjacent values will typically have very few set bits so can result in high compression ratios, perhaps even 8x. Implementing codecs such as xor or delta encoding is a feature that has been discussed before, would not be very difficult, and it would solve your problem in a way making metric columns compressible would not.