Open BohuTANG opened 1 year ago
hi BohuTANG, I try to optimize this issue.
I Deploy a Standalone Databend, and create hits table , and load data by copy into hits from 'https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz' FILE_FORMAT=(type='TSV' compression=AUTO);
But I got a error, how can i solve this problem๏ผ
@ethzx Try to load the data using streaming load, I did not test it by copy into from GZ compressed files.
Also, the original data is too large, you can use this https://databend.rs/doc/use-cases/analyze-hits-dataset-with-databend
BTW, I do not think it's an easy task.
@ethzx
This error is due to your network and the hits dataset server.
You can try the following:
Aliyun ECS(ap-south region)
AWS EC2
hello, Is this still a problem after btrblocks?
hello, Is this still a problem after btrblocks?
We don't have any improvement in the parquet format.
hello, Is this still a problem after btrblocks?
We don't have any improvement in the parquet format.
Hi, I found that parquet also supports lightweight codecs (like RLE, delte...), but databend only uses Encoding::PLAIN
at blocks_to_parquet
. Because of performance issues https://github.com/datafuselabs/databend/pull/9412
I'm not sure if the same performance issue exists with the native format? If no problems are found in the native format, maybe we should also use btrblocks in the parquet format to choose a more suitable codec for different data distributions?
Hi.
databend only uses Encoding::PLAIN at blocks_to_parquet. Because of performance issues https://github.com/datafuselabs/databend/pull/9412
Yes, we found plain encode is enough in S3 because we already have the common compressor (lz4, zstd), plain encode will save the extra CPU resource.
I'm not sure if the same performance issue exists with the native format?
Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.
Parquet format has its standard formats, we can't modify it as btrblocks
with incompatibilities otherwise we will have a new format.
As it describes in paper
Update the standard or create a new format? Yet, improving existing widespread formats such as Parquet is more desirable than creating a new data format: For users, there would be no costly data migration, no breaking changes and fast decompression just by updating a library version. Unfortunately, our experiments indi- cate that low-level improvements are not enough, and integrating larger parts of BtrBlocks โ such as new encodings and cascading compression โ into Parquet will cause version incompatibilities. Such a โParquet v3โ would not share much with the original besides the name, with no actual benefit to existing users of Parquet. In- stead, we have open-sourced BtrBlocks and hope that compatible improvements will find their way into Parquet, while also building a new format based on BtrBlocks that is independent of Parquet
Thanks for your reply!
Native format is experimental format, we can keep improving this format. Currently, native format's deserialization is faster than parquet. In s3 storage, encoder with a little bit high compression ratio seems not to be the first algorithm to optimize. Because s3 has high io through-put, it does not matter to read a bit more data, actually we already merge much small io to be a large one with merge io.
Parquet format has its standard formats, we can't modify it as
btrblocks
with incompatibilities otherwise we will have a new format. As it describes in paper
So, we don't need to pay too much attention to the issue of parquet format, just wait for the native format to be ready for production, do I understand correctly?
Summary From my test, hits Q22 is slow and reads more data than snowflake. SQL:
Databend: Scan: 2.6G
Snowflake: Scan: 1.6G
References:
For URL short string, some improvements in DuckDB, Lightweight Compression in DuckDB.
Parquet Delta Strings: (DELTA_BYTE_ARRAY = 7)
smaz needs a try: https://docs.rs/smaz/latest/smaz/