apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.23k stars 2.17k forks source link

S3 compression Issue with Iceberg #8713

Open swat1234 opened 12 months ago

swat1234 commented 12 months ago

Iceberg tables not compressing parquet file in s3. When the below Table parameters are used for the Compression the file size is increasing in comparison with uncompression. Can some one please assist on the same.

  1. File with UNCOMPRESSED codec.

00000-0-0129ba78-17f6-466f-b57b-695c678d64d5-00001.parquet === size 682 bytes

}, "properties" : { "codec" : "UNCOMPRESSED",


  1. File with gzip codec 733 bytes

00000-0-e6f22c0e-2e16-43aa-8a5f-efabee995876-00001.parquet

"properties" : { "codec" : "GZIP",


  1. File with code snappy codec 686 bytes.

00000-0-36fd4aad-8c38-40f5-8241-78ffe4f0a032-00001.parquet

"codec" : "SNAPPY", "path" : {


Table Properties:

"parquet.compression": "SNAPPY" "read.parquet.vectorization.batch-size": "5000" "read.split.target-size": "134217728" "read.parquet.vectorization.enabled": "true" "write.parquet.page-size-bytes": "1048576" "write.parquet.row-group-size-bytes": "134217728" "write_compression": "SNAPPY" "write.parquet.compression-codec": "snappy" "write.metadata.metrics.max-inferred-column-defaults": "100" "write.parquet.compression-level": "4" "write.target-file-size-bytes": "536870912" "write.delete.target-file-size-bytes": "67108864" "write.parquet.page-row-limit": "20000" "write.format.default": "parquet" "write.metadata.compression-codec": "gzip" "write.compression": "SNAPPY"

Thanks in advance!!

nastra commented 12 months ago

I see that you configured "write.metadata.compression-codec": "gzip" but this is for table metadata files being compressed, not individual data files. Also any particular reason to set parquet.compression / write.compression / ... and all the others? The setting that controls this is write.parquet.compression-codec, which defaults to gzip.

swat1234 commented 12 months ago

I am are trying to reduce the storage space of the files by applying Snappy or Gzip compression. I can see metadata is getting compression to gzip but not the data files. Could you guide me on how to do it,

nastra commented 12 months ago

I would probably start first by reducing the amount of random table properties being set. As I mentioned earlier, the one that matters in your case is write.parquet.compression-codec, which defaults to gzip, but can also be set to snappy or zstd. The other setting you can experiment with is write.parquet.compression-level.

swat1234 commented 12 months ago

We tried with only write.parquet.compression-codec parameter set to snappy, gzip but it is not working. Instead of compressing, the size is getting increased.

RussellSpitzer commented 12 months ago

If you are only trying with sub kilobyte files the results will be bad. You have some amortized costs there and most of the file (footers) will not be compressed. Try with larger files

amogh-jahagirdar commented 12 months ago

+1 to @RussellSpitzer point. These files seem way too small for compression to play a significant role and be meaningful. Compression is most noticeable on significant amounts of "similar" data.

swat1234 commented 12 months ago

I have tried with huge data. Below are the outcomes.

  1. File with UNCOMPRESSED codec. - 1.8GB

  2. File with gzip codec - 1.8GB

  3. File with code snappy codec - 2.8GB

swat1234 commented 12 months ago

Can some one please advise.

jhchee commented 12 months ago

@swat1234 The result with UNCOMPRESSED codec looks unusual (and it shouldn't be smaller than snappy). Are you sure that you are using this config in your experiment which will be like:

write.parquet.compression-codec=uncompressed

ZTSD may compress better than GZIP when using higher level. As per suggestion:

write.parquet.compression-codec=zstd
write.parquet.compression-level=9
swat1234 commented 11 months ago

Hi @jhchee , Thanks for your response. We are mainly looking for the compression using SNAPPY.

But snappy is increasing the file size.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.