Open Polepoint opened 3 months ago
cc @pitrou @wgtmac does this in spec or we have similar problem before? 🤔
We see here the problem with the "lz4-hadoop" codec: its spec only exists as Java source code. That's why it's superseded by LZ4_RAW.
IIRC, LZ4_HADOOP has been deprecated in the Parquet spec. We should use LZ4_RAW in favor of LZ4_HADOOP at all times.
This can be handled like the code below you mind. If you want this you can add it here and add a file in parquet-testing? @Polepoint ?
This can be handled like the code below you mind. If you want this you can add it here and add a file in parquet-testing? @Polepoint ?
:ok_hand: I will try.
@pitrou But which format refers to LZ4_RAW? It sesms there is only LZ4, LZ4_FRAME, LZ4_HADOOP: https://github.com/apache/arrow/blob/9b27f42e02d9c4208698a324357cafaaa3e308ce/cpp/src/arrow/util/type_fwd.h#L46
Ah, sorry. LZ4_RAW is simply the same as LZ4 here.
Le 6 novembre 2024 22:12:54 GMT+01:00, nanoric @.***> a écrit :
But which format refers to LZ4_RAW? It sesms there is only LZ4, LZ4_FRAME, LZ4_HADOOP: https://github.com/apache/arrow/blob/9b27f42e02d9c4208698a324357cafaaa3e308ce/cpp/src/arrow/util/type_fwd.h#L46
-- Reply to this email directly or view it on GitHub: https://github.com/apache/arrow/issues/43745#issuecomment-2460792489 You are receiving this because you were mentioned.
Message ID: @.***>
Describe the bug, including details regarding any error messages, version, and platform.
platform: Ubuntu 22.04, x86_64 arrow: release-17.0.0-rc2
According to the https://github.com/apache/hadoop/blob/release-3.4.1-RC1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/BlockCompressorStream.java#L82 and https://github.com/apache/hadoop/blob/release-3.4.1-RC1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress/Lz4Codec.java#L92 the lz4-hadoop should be implemented with block stream, which means that the input may be split into blocks, each block will be compressed with lz4. The outputs will be like that
- 4-byte big endian uncompressed_size of all blocks
- 4-byte big endian compressed_size of the flowing block
< lz4 compressed block >
- 4-byte big endian compressed_size of the flowing block
< lz4 compressed block >
- 4-byte big endian compressed_size of the flowing block
< lz4 compressed block >
... repeated until uncompressed_size from outer block is consumed ...
The implement of lz4-hadoop decompression in arrow seems only accept one block, as it will return
kNotHadoop
immediately if themaybe_decompressed_size
of the first block is not equal toexpected_decompressed_size
(actually, it is the size of all blocks's decompressed_size). https://github.com/apache/arrow/blob/release-17.0.0-rc2/cpp/src/arrow/util/compression_lz4.cc#L509Code Example
Java write
C++ read
Create the parquet file with java and read the file with c++ will get the error
Corrupt Lz4 compressed data.
from arrow status message.Component(s)
C++