performance: compressing for short strings

BohuTANG commented 1 year ago

Summary From my test, hits Q22 is slow and reads more data than snowflake. SQL:

SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;

Databend: Scan: 2.6G

Snowflake: Scan: 1.6G 35f026aa-4ce2-496f-96b5-46928e8b709c

References:

For URL short string, some improvements in DuckDB, Lightweight Compression in DuckDB.
Parquet Delta Strings: (DELTA_BYTE_ARRAY = 7)

smaz needs a try: https://docs.rs/smaz/latest/smaz/

This is a small string compressed by 50%
foobar compressed by 34%
the end compressed by 58%
not-a-g00d-Exampl333 enlarged by 15%
Smaz is a simple compression library compressed by 39%
Nothing is more difficult, and therefore more precious, than to be able to decide compressed by 49%
this is an example of what works very well with smaz compressed by 49%
1000 numbers 2000 will 10 20 30 compress very little compressed by 10%
and now a few italian sentences: compressed by 41%
Nel mezzo del cammin di nostra vita, mi ritrovai in una selva oscura compressed by 33%
Mi illumino di immenso compressed by 37%
L'autore di questa libreria vive in Sicilia compressed by 28%
try it against urls compressed by 37%
http://google.com compressed by 59%
http://programming.reddit.com compressed by 52%

ethzx commented 1 year ago

hi BohuTANG, I try to optimize this issue.

I Deploy a Standalone Databend, and create hits table , and load data by copy into hits from 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz' FILE_FORMAT=(type='TSV' compression=AUTO);

But I got a error, how can i solve this problem？

sundy-li commented 1 year ago

@ethzx Try to load the data using streaming load, I did not test it by copy into from GZ compressed files.

Also, the original data is too large, you can use this https://databend.rs/doc/use-cases/analyze-hits-dataset-with-databend

sundy-li commented 1 year ago

BTW, I do not think it's an easy task.

BohuTANG commented 1 year ago

@ethzx

This error is due to your network and the hits dataset server. You can try the following:
Aliyun ECS(ap-south region) AWS EC2

ncuwaln commented 6 months ago

hello, Is this still a problem after btrblocks?

sundy-li commented 6 months ago

hello, Is this still a problem after btrblocks?

We don't have any improvement in the parquet format.

ncuwaln commented 6 months ago

hello, Is this still a problem after btrblocks?

We don't have any improvement in the parquet format.

Hi, I found that parquet also supports lightweight codecs (like RLE, delte...), but databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412

I'm not sure if the same performance issue exists with the native format? If no problems are found in the native format, maybe we should also use btrblocks in the parquet format to choose a more suitable codec for different data distributions?

sundy-li commented 6 months ago

Hi.

databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412

Yes, we found plain encode is enough in S3 because we already have the common compressor (lz4, zstd), plain encode will save the extra CPU resource.

I'm not sure if the same performance issue exists with the native format?

Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.

Parquet format has its standard formats, we can't modify it as btrblocks with incompatibilities otherwise we will have a new format. As it describes in paper

Update the standard or create a new format? Yet, improving existing widespread formats such as Parquet is more desirable than creating a new data format: For users, there would be no costly data migration, no breaking changes and fast decompression just by updating a library version. Unfortunately, our experiments indi- cate that low-level improvements are not enough, and integrating larger parts of BtrBlocks – such as new encodings and cascading compression – into Parquet will cause version incompatibilities. Such a “Parquet v3” would not share much with the original besides the name, with no actual benefit to existing users of Parquet. In- stead, we have open-sourced BtrBlocks and hope that compatible improvements will find their way into Parquet, while also building a new format based on BtrBlocks that is independent of Parquet

ncuwaln commented 6 months ago

Thanks for your reply!

Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.

Parquet format has its standard formats, we can't modify it as btrblocks with incompatibilities otherwise we will have a new format. As it describes in paper

So, we don't need to pay too much attention to the issue of parquet format, just wait for the native format to be ready for production, do I understand correctly?

datafuselabs / databend

performance: compressing for short strings #9001