Closed TechnophobicLampshade closed 6 months ago
Judging by how things are done in compression.cc
[1], I think the fix is just a matter of adding all the lz4 codecs in the switch protected by an #ifdef ARROW_WITH_LZ4
.
[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/compression.cc#L187
Thanks @felipecrv . As it stands it looks like LZ4 (not LZ4_FRAME) is unconditionally supported (i.e. no ifdef) in IsSupportedCodec
right?
Thanks @felipecrv . As it stands it looks like LZ4 (not LZ4_FRAME) is unconditionally supported (i.e. no ifdef) in
IsSupportedCodec
right?
Yes, my bet is that is wrong and adding all LZ4*
formats protected by the ifdef
would be fine. Could you test it?
I'm not familiar with the Parquet code, but from what I saw, this should work fine.
Hmm, no. The compression formats supported by the Parquet format are listed here: https://github.com/apache/parquet-format/blob/master/Compression.md
LZ4_FRAME is not part of them. It is used for buffer compression in the Arrow IPC format: https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/format/Message.fbs#L45-L79
Agree with pitrou. The encoding is part of the spec. Why do you want LZ4_FRAME?
Ah, thanks all - I had confused LZ4_FRAME and LZ4_RAW. I read that 'LZ4' encoding was deprecated in Parquet, and that I should use LZ4_RAW. What enum value should I use with ParquetFileWriter
to write LZ4_RAW?
Just Compression::LZ4
. The legacy LZ4 format is obtained using Compression::LZ4_HADOOP
.
Doh. Thanks! I had assumed that Compression::LZ4
was the old one, given it matched the enum name in the thrift file.
So this issue can be closed, right?
Yes, thanks all!
Describe the bug, including details regarding any error messages, version, and platform.
I'm using the low-level parquet API to write data. When I run the above I get:
because it seems like LZ4_FRAME is not present in this code: https://github.com/apache/arrow/blob/5f75dbf13207b1ec6f4d922c37b9f8bffbd9c3fe/cpp/src/parquet/types.cc#L38
Despite that, I am writing data from pyarrow using the LZ4_FRAME encoding and c++ reader code using the
parquet::ParquetFileReader
API seems to handle it OK. I'm not familiar with the internals and why read/write would differ here. I wonder if there is anything preventing LZ4_FRAME from being added toIsSupportedCodec
?Component(s)
C++, Parquet