apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.39k stars 1.26k forks source link

chunk compression type is hardcoded to passthrough for metric columns #7973

Open ashishkf opened 2 years ago

ashishkf commented 2 years ago

There doesn't seem to be a way to use LZ4 compression for metric column.

https://github.com/apache/pinot/blob/f2f8e38f9424bcacf3946197c9afcd50ef1d58fa/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/RealtimeSegmentConverter.java#L100

richardstartin commented 2 years ago

This makes sense the way it is for a couple of reasons:

These two factors combine to make a less than compelling case for general purpose compression of metric columns.

There are numerous encoding techniques which could be explored for metric columns in the future, which tend to produce better space reductions and are faster to decode.

If you have a metric column which you expect to be compressible because it has lots of duplicates, it would be worth experimenting with using a dictionary column instead.

Jackie-Jiang commented 2 years ago

IMO this is a bug. For metrics, we use PASS_THROUGH by default, but should allow overriding it if it is explicitly configured in the FieldConfig

richardstartin commented 2 years ago

@Jackie-Jiang let's make it configurable when there are encoding modes which make sense for numeric data. LZ4 and Snappy aren't good options for numeric data, and are dominated by dictionary encoding.

richardstartin commented 2 years ago

Here are the sizes of 8KB (1024 doubles) of different distributions/patterns with Snappy and LZ4. There are encodings which can be introduced to reduce the size of metric columns (e.g. xor or delta encoding) but making it possible to compress metric column with general purpose compression algorithms isn't in the user's interest.

Compression Distribution Compressed Size (KB)
Uncompressed integer increments 8.00
LZ4 integer increments 4.09
Snappy integer increments 4.02
Uncompressed noisy increments 8.00
LZ4 noisy increments 8.03
Snappy noisy increments 8.00
Uncompressed sinusoidal 8.00
LZ4 sinusoidal 8.03
Snappy sinusoidal 8.00
Uncompressed normal(0,1) 8.00
LZ4 normal(0,1) 8.03
Snappy normal(0,1) 8.00
Uncompressed exp(0.999) 8.00
LZ4 exp(0.999) 7.23
Snappy exp(0.999) 7.16
ashishkf commented 2 years ago

I think in many cases the metric values don't change much - for example cpu usage gauge will have only slight variations for a given metric series over a small interval (say, 10 minutes). We have seen good compression ratios - in our data (storing Kubernetes metrics), we are getting 1 bytes per row instead of 8 allocated for the 'double' column. As a workaround we marked the value column as dimension to get the LZ4 compression to get the savings.

richardstartin commented 2 years ago

The problem is you won't get any savings from LZ4 - those CPU readings can be almost identical but with a little bit of noise the data is difficult for a text oriented algorithm like LZ4 to compress. The XOR of any two adjacent values will typically have very few set bits so can result in high compression ratios, perhaps even 8x. Implementing codecs such as xor or delta encoding is a feature that has been discussed before, would not be very difficult, and it would solve your problem in a way making metric columns compressible would not.