apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[Parquet][C++] Support writing LZ4_FRAME in parquet C++ low-level API #40213

Closed TechnophobicLampshade closed 6 months ago

TechnophobicLampshade commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

            parquet::WriterProperties::Builder builder;
            builder.compression(parquet::Compression::LZ4_FRAME);

            using FileClass = ::arrow::io::FileOutputStream;
            std::shared_ptr<FileClass> out_file = FileClass::Open(filename_).ValueOrDie();
            std::shared_ptr<parquet::WriterProperties> props = builder.build();
            parquetFile_ = parquet::ParquetFileWriter::Open(out_file, schema, props);

I'm using the low-level parquet API to write data. When I run the above I get:

terminate called after throwing an instance of 'parquet::ParquetException'
  what():  Codec type lz4 not supported in Parquet format

because it seems like LZ4_FRAME is not present in this code: https://github.com/apache/arrow/blob/5f75dbf13207b1ec6f4d922c37b9f8bffbd9c3fe/cpp/src/parquet/types.cc#L38

Despite that, I am writing data from pyarrow using the LZ4_FRAME encoding and c++ reader code using the parquet::ParquetFileReader API seems to handle it OK. I'm not familiar with the internals and why read/write would differ here. I wonder if there is anything preventing LZ4_FRAME from being added to IsSupportedCodec?

Component(s)

C++, Parquet

felipecrv commented 7 months ago

Judging by how things are done in compression.cc [1], I think the fix is just a matter of adding all the lz4 codecs in the switch protected by an #ifdef ARROW_WITH_LZ4.

[1] https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/compression.cc#L187

TechnophobicLampshade commented 7 months ago

Thanks @felipecrv . As it stands it looks like LZ4 (not LZ4_FRAME) is unconditionally supported (i.e. no ifdef) in IsSupportedCodec right?

felipecrv commented 7 months ago

Thanks @felipecrv . As it stands it looks like LZ4 (not LZ4_FRAME) is unconditionally supported (i.e. no ifdef) in IsSupportedCodec right?

Yes, my bet is that is wrong and adding all LZ4* formats protected by the ifdef would be fine. Could you test it?

I'm not familiar with the Parquet code, but from what I saw, this should work fine.

pitrou commented 6 months ago

Hmm, no. The compression formats supported by the Parquet format are listed here: https://github.com/apache/parquet-format/blob/master/Compression.md

LZ4_FRAME is not part of them. It is used for buffer compression in the Arrow IPC format: https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/format/Message.fbs#L45-L79

mapleFU commented 6 months ago

Agree with pitrou. The encoding is part of the spec. Why do you want LZ4_FRAME?

TechnophobicLampshade commented 6 months ago

Ah, thanks all - I had confused LZ4_FRAME and LZ4_RAW. I read that 'LZ4' encoding was deprecated in Parquet, and that I should use LZ4_RAW. What enum value should I use with ParquetFileWriter to write LZ4_RAW?

pitrou commented 6 months ago

Just Compression::LZ4. The legacy LZ4 format is obtained using Compression::LZ4_HADOOP.

TechnophobicLampshade commented 6 months ago

Doh. Thanks! I had assumed that Compression::LZ4 was the old one, given it matched the enum name in the thrift file.

felipecrv commented 6 months ago

So this issue can be closed, right?

TechnophobicLampshade commented 6 months ago

Yes, thanks all!