apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
689 stars 483 forks source link

ORC-1593: Set `orc.compression.zstd.level` to 3 by default #1760

Closed dongjoon-hyun closed 9 months ago

dongjoon-hyun commented 9 months ago

What changes were proposed in this pull request?

This PR aims to set orc.compression.zstd.level to 3 by default.

Why are the changes needed?

To prevent a regression from ORC 1.9.x

ORC 1.9

data/generated//taxi:
total 2196176
drwxr-xr-x  5 dongjoon  staff   160B Jan 17 08:02 .
drwxr-xr-x  5 dongjoon  staff   160B Jan 17 08:07 ..
-rw-r--r--  1 dongjoon  staff   299M Jan 17 08:03 orc.zstd

ORC 2.0

-rw-r--r--  1 dongjoon  staff   334M Jan 17 07:56 orc.zstd (level 1)
-rw-r--r--  1 dongjoon  staff   299M Jan 17 08:16 orc.zstd (level 3)
-rw-r--r--  1 dongjoon  staff   302M Jan 17 08:21 orc.zstd (level 4)
-rw-r--r--  1 dongjoon  staff   300M Jan 17 08:27 orc.zstd (level 5)

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.

cxzl25 commented 9 months ago

Thanks @dongjoon-hyun .

The default compression levels of aircompressor used by ORC and zstd-jni used by parquet are both level 3. I verified in the online environment that zstd-jni level 3 is not worse than aircompressor level 3.

https://github.com/airlift/aircompressor/blob/ca561c8214100b1e646a395c2683212419719dc8/src/main/java/io/airlift/compress/zstd/CompressionParameters.java#L26

https://github.com/apache/parquet-mr/blob/c82d5b471a558124b03e67759038661a046f5938/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/ZstandardCodec.java#L52

dongjoon-hyun commented 9 months ago

Ya, thank you for checking.