datafuselabs / databend

๐——๐—ฎ๐˜๐—ฎ, ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜๐—ถ๐—ฐ๐˜€ & ๐—”๐—œ. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.67k stars 727 forks source link

performance: compressing for short strings #9001

Open BohuTANG opened 1 year ago

BohuTANG commented 1 year ago

Summary From my test, hits Q22 is slow and reads more data than snowflake. SQL:

SELECT SearchPhrase, MIN(URL), COUNT(*) AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10;

Databend: Scan: 2.6G

image

Snowflake: Scan: 1.6G 35f026aa-4ce2-496f-96b5-46928e8b709c

References:

ethzx commented 1 year ago

hi BohuTANG, I try to optimize this issue.

I Deploy a Standalone Databend, and create hits table , and load data by copy into hits from 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz' FILE_FORMAT=(type='TSV' compression=AUTO);

But I got a error, how can i solve this problem๏ผŸ image

sundy-li commented 1 year ago

@ethzx Try to load the data using streaming load, I did not test it by copy into from GZ compressed files.

Also, the original data is too large, you can use this https://databend.rs/doc/use-cases/analyze-hits-dataset-with-databend

sundy-li commented 1 year ago

BTW, I do not think it's an easy task.

BohuTANG commented 1 year ago

@ethzx

This error is due to your network and the hits dataset server. You can try the following:
Aliyun ECS(ap-south region) AWS EC2

ncuwaln commented 6 months ago

hello, Is this still a problem after btrblocks?

sundy-li commented 6 months ago

hello, Is this still a problem after btrblocks?

We don't have any improvement in the parquet format.

ncuwaln commented 6 months ago

hello, Is this still a problem after btrblocks?

We don't have any improvement in the parquet format.

Hi, I found that parquet also supports lightweight codecs (like RLE, delte...), but databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412

I'm not sure if the same performance issue exists with the native format? If no problems are found in the native format, maybe we should also use btrblocks in the parquet format to choose a more suitable codec for different data distributions?

sundy-li commented 6 months ago

Hi.

databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412

Yes, we found plain encode is enough in S3 because we already have the common compressor (lz4, zstd), plain encode will save the extra CPU resource.

I'm not sure if the same performance issue exists with the native format?

Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.

Parquet format has its standard formats, we can't modify it as btrblocks with incompatibilities otherwise we will have a new format. As it describes in paper

Update the standard or create a new format? Yet, improving existing widespread formats such as Parquet is more desirable than creating a new data format: For users, there would be no costly data migration, no breaking changes and fast decompression just by updating a library version. Unfortunately, our experiments indi- cate that low-level improvements are not enough, and integrating larger parts of BtrBlocks โ€“ such as new encodings and cascading compression โ€“ into Parquet will cause version incompatibilities. Such a โ€œParquet v3โ€ would not share much with the original besides the name, with no actual benefit to existing users of Parquet. In- stead, we have open-sourced BtrBlocks and hope that compatible improvements will find their way into Parquet, while also building a new format based on BtrBlocks that is independent of Parquet

ncuwaln commented 6 months ago

Thanks for your reply!

Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.

Parquet format has its standard formats, we can't modify it as btrblocks with incompatibilities otherwise we will have a new format. As it describes in paper

So, we don't need to pay too much attention to the issue of parquet format, just wait for the native format to be ready for production, do I understand correctly?