apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.55k stars 3.54k forks source link

[Java] Slow LZ4 compression using java Arrow 12.0.0 #35824

Open ebremer opened 1 year ago

ebremer commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

500 MB of arrow data took a few hours to compress. A 5GB selection ran for a couple of days and did not complete. The one that did worked fine once it was done and I was able to read it back in and use it. Without compression, data writes out fine in a much shorter time frame for both. Using the following code to set up my writer:

ArrowFileWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(fos), new HashMap<>(), IpcOption.DEFAULT, CommonsCompressionFactory.INSTANCE, CompressionUtil.CodecType.LZ4_FRAME));

Running code with:

java -version
openjdk version "17.0.7" 2023-04-18
OpenJDK Runtime Environment GraalVM CE 22.3.2 (build 17.0.7+7-jvmci-22.3-b18)
OpenJDK 64-Bit Server VM GraalVM CE 22.3.2 (build 17.0.7+7-jvmci-22.3-b18, mixed mode, sharing)

Component(s)

Java

pitrou commented 1 year ago

@lidavidm @davisusanibar

emkornfield commented 1 year ago

This is a known issue, there still isn't to my knowledge a canonical performance framed lz4 compressor/decompressor for java