Closed cxzl25 closed 9 months ago
I changed the default level to 1 and compared quickly with the generate benchmark. Level 1 is still smaller.
JAVA (Aircompressor)
$ java -Dorc.compression.zstd.impl=java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 100000
$ ls -alR data | tail -n3
-rw-r--r-- 1 dongjoon staff 10746324 Jan 16 14:23 orc.gz
-rw-r--r-- 1 dongjoon staff 12133885 Jan 16 14:23 orc.snappy
-rw-r--r-- 1 dongjoon staff 10642346 Jan 16 14:23 orc.zstd
ZSTD-JNI
$ java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -f orc -d sales -s 100000
$ ls -alR data | tail -n3
-rw-r--r-- 1 dongjoon staff 10746324 Jan 16 14:23 orc.gz
-rw-r--r-- 1 dongjoon staff 12133885 Jan 16 14:23 orc.snappy
-rw-r--r-- 1 dongjoon staff 10543260 Jan 16 14:23 orc.zstd
Thank you, @cxzl25 and all!
Thanks for all the help!
Migrating from zlib to zstd, a table has a compression rate of 35% through aircompressor. By adjusting some parameters of zstd-jni, a compression rate of 44% is achieved.
My bad. It seems that I made a regression at Taxi
data compression.
ORC 1.9
data/generated//taxi:
total 2196176
drwxr-xr-x 5 dongjoon staff 160B Jan 17 08:02 .
drwxr-xr-x 5 dongjoon staff 160B Jan 17 08:07 ..
-rw-r--r-- 1 dongjoon staff 299M Jan 17 08:03 orc.zstd
ORC 2.0
-rw-r--r-- 1 dongjoon staff 334M Jan 17 07:56 orc.zstd (level 1)
-rw-r--r-- 1 dongjoon staff 299M Jan 17 08:16 orc.zstd (level 3)
-rw-r--r-- 1 dongjoon staff 302M Jan 17 08:21 orc.zstd (level 4)
-rw-r--r-- 1 dongjoon staff 300M Jan 17 08:27 orc.zstd (level 5)
ZStd compression level looks inconsistent with this dataset and let me change the zstd level
change back to 3 like the original proposal.
What changes were proposed in this pull request?
Original PR: https://github.com/apache/orc/pull/988 Original author: @dchristle
This PR will support the use of zstd-jni library as the implementation of ORC zstd, with better performance than aircompressor. (https://github.com/apache/orc/pull/988#issuecomment-1884443205)
This PR also exposes the compression level and "long mode" settings to ORC users. These settings allow the user to select different speed/compression trade-offs that were not supported by the original aircompressor.
Why are the changes needed?
These change makes sense for a few reasons:
ORC users will gain all the improvements from the main zstd library. It is under active development and receives regular speed and compression improvements. In contrast, aircompressor's zstd implementation is older and stale.
ORC users will be able to use the entire speed/compression tradeoff space. Today, aircompressor's implementation has only one of eight compression strategies (link). This means only a small range of faster but less compressive strategies can be exposed to ORC users. ORC storage with high compression (e.g. for large-but-infrequently-used data) is a clear use case that this PR would unlock.
It will harmonize the Java ORC implementation with other projects in the Hadoop ecosystem. Parquet, Spark, and even the C++ ORC reader/writers all rely on the official zstd implementation either via zstd-jni or directly. In this way, the Java reader/writer code is an outlier.
Detection and fixing any bugs or regressions will generally happen much faster, given the larger number of users and active developer community of zstd and zstd-jni.
The largest tradeoff is that zstd-jni wraps compiled code. That said, many microprocessor architectures are already targeted & bundled into zstd-jni, so this should be a rare hurdle.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No