Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index

GSharayu commented 3 years ago

When the forward index is not dictionary encoded, we have 2 choices:

store the data as is (RAW)
store the data snappy compressed - using snappy compression codec library

In addition to snappy, we should add support for other compression codecs subject to their availability in Java libraries.

Currently by default we use Snappy compression. However, this didn't really give good compression ratio for free-text data. LZO is known to provide better compression ratio and speed for larger char/varchar data.

So, we should explore other options

Firstly, we should start with simple test case to compress and uncompress direct byte buffer and do some functional and performance tests.

see ZSTD library in Java - https://github.com/luben/zstd-jni

Any new ideas/suggestions?

xiangfu0 commented 3 years ago

there is some previous discussion here: https://github.com/apache/incubator-pinot/issues/5407

yupeng9 commented 3 years ago

I'd also suggest Gorilla TSZ compression to the list, which was proposed by Facebook Gorilla project. This compression algorithm is adopted by Uber's m3TSZ, which showed 40% improvement over standard TSZ, observed from Uber's production workloads.

xiangfu0 commented 3 years ago

I feel we should separate encoding and compression. Maybe add two new fields into the schema. For raw index encoding, we can support multiple options for different types: e.g. INT/LONG : Delta/DoubleDelta/Gorrila FLOAT/DOUBLE: Gorrila

For compression, we can support LZO, LZ4, ZSTD, DEFLATE, GZIP, SNAPPY etc

siddharthteotia commented 3 years ago

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there

The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.

This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

xiangfu0 commented 3 years ago

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there

The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.

This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

siddharthteotia commented 3 years ago

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc. This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?

xiangfu0 commented 3 years ago

In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc. This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed

Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?

Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?

Cause we want to allow tuning compression per column basis, e.g. column1 in snappy and column2 in lz4 right?

This info can be stored:

either inside FieldSpec in schema,
or add a new field in tableConfig, with a map of columns to compression type mapping.

siddharthteotia commented 3 years ago

Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.

xiangfu0 commented 3 years ago

Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.

Got it. For this part, if users change the config for a column compression type, do we consider rebuilding the column data?

apache / pinot

Support LZO, LZ4, ZSTD, DEFLATE, GZIP compression codecs for raw index #6804