Open GSharayu opened 3 years ago
there is some previous discussion here: https://github.com/apache/incubator-pinot/issues/5407
I'd also suggest Gorilla TSZ compression to the list, which was proposed by Facebook Gorilla project. This compression algorithm is adopted by Uber's m3TSZ, which showed 40% improvement over standard TSZ, observed from Uber's production workloads.
I feel we should separate encoding and compression. Maybe add two new fields into the schema. For raw index encoding, we can support multiple options for different types: e.g. INT/LONG : Delta/DoubleDelta/Gorrila FLOAT/DOUBLE: Gorrila
For compression, we can support LZO, LZ4, ZSTD, DEFLATE, GZIP, SNAPPY etc
In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there
The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.
This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed
In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there
The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc.
This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed
Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?
In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc. This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed
Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?
Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?
In columnar database world, encoding is commonly referred to column level compression techniques which play nice with columnar query execution - dictionary encoding, RLE, delta where the true benefit is that query processing can happen faster on compressed columnar data (e.g dictionary encoding) and obviously storage saving is there The purpose of this issue is to not add any new column level encoding. I was thinking of having a separate issue to enhance column level encoding support with RLE, DELTA, PFORDELTA etc. This issue is for supporting additional data compression codecs for raw data which are currently snappy compressed
Got it, so we should still make it columnar and do it at block level? Then we still need to add this to schema right?
Yes this will also be columnar and will be block level although LZ4 supports some form of streaming/frame. I am not sure why do we need to add it to schema? Do you mean configuring it via table config?
Cause we want to allow tuning compression per column basis, e.g. column1 in snappy and column2 in lz4 right?
This info can be stored:
Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.
Right. Since all the per column index/encoding/compressions and any tuning info is in table config, may be we can continue to have it in the config. The table config already has a field to capture this.
Got it. For this part, if users change the config for a column compression type, do we consider rebuilding the column data?
When the forward index is not dictionary encoded, we have 2 choices:
In addition to snappy, we should add support for other compression codecs subject to their availability in Java libraries.
Currently by default we use Snappy compression. However, this didn't really give good compression ratio for free-text data. LZO is known to provide better compression ratio and speed for larger char/varchar data.
So, we should explore other options
Firstly, we should start with simple test case to compress and uncompress direct byte buffer and do some functional and performance tests.
see ZSTD library in Java - https://github.com/luben/zstd-jni
Any new ideas/suggestions?