Open swat1234 opened 12 months ago
I see that you configured "write.metadata.compression-codec": "gzip"
but this is for table metadata files being compressed, not individual data files. Also any particular reason to set parquet.compression
/ write.compression
/ ... and all the others? The setting that controls this is write.parquet.compression-codec
, which defaults to gzip
.
I am are trying to reduce the storage space of the files by applying Snappy or Gzip compression. I can see metadata is getting compression to gzip but not the data files. Could you guide me on how to do it,
I would probably start first by reducing the amount of random table properties being set.
As I mentioned earlier, the one that matters in your case is write.parquet.compression-codec
, which defaults to gzip
, but can also be set to snappy
or zstd
.
The other setting you can experiment with is write.parquet.compression-level
.
We tried with only write.parquet.compression-codec parameter set to snappy, gzip but it is not working. Instead of compressing, the size is getting increased.
If you are only trying with sub kilobyte files the results will be bad. You have some amortized costs there and most of the file (footers) will not be compressed. Try with larger files
+1 to @RussellSpitzer point. These files seem way too small for compression to play a significant role and be meaningful. Compression is most noticeable on significant amounts of "similar" data.
I have tried with huge data. Below are the outcomes.
File with UNCOMPRESSED codec. - 1.8GB
File with gzip codec - 1.8GB
File with code snappy codec - 2.8GB
Can some one please advise.
@swat1234 The result with UNCOMPRESSED codec looks unusual (and it shouldn't be smaller than snappy). Are you sure that you are using this config in your experiment which will be like:
write.parquet.compression-codec=uncompressed
ZTSD may compress better than GZIP when using higher level. As per suggestion:
write.parquet.compression-codec=zstd
write.parquet.compression-level=9
Hi @jhchee , Thanks for your response. We are mainly looking for the compression using SNAPPY.
But snappy is increasing the file size.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Iceberg tables not compressing parquet file in s3. When the below Table parameters are used for the Compression the file size is increasing in comparison with uncompression. Can some one please assist on the same.
00000-0-0129ba78-17f6-466f-b57b-695c678d64d5-00001.parquet === size 682 bytes
}, "properties" : { "codec" : "UNCOMPRESSED",
00000-0-e6f22c0e-2e16-43aa-8a5f-efabee995876-00001.parquet
"properties" : { "codec" : "GZIP",
00000-0-36fd4aad-8c38-40f5-8241-78ffe4f0a032-00001.parquet
"codec" : "SNAPPY", "path" : {
Table Properties:
"parquet.compression": "SNAPPY" "read.parquet.vectorization.batch-size": "5000" "read.split.target-size": "134217728" "read.parquet.vectorization.enabled": "true" "write.parquet.page-size-bytes": "1048576" "write.parquet.row-group-size-bytes": "134217728" "write_compression": "SNAPPY" "write.parquet.compression-codec": "snappy" "write.metadata.metrics.max-inferred-column-defaults": "100" "write.parquet.compression-level": "4" "write.target-file-size-bytes": "536870912" "write.delete.target-file-size-bytes": "67108864" "write.parquet.page-row-limit": "20000" "write.format.default": "parquet" "write.metadata.compression-codec": "gzip" "write.compression": "SNAPPY"
Thanks in advance!!