apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.89k stars 3.38k forks source link

[Format] Add body buffer compression option to IPC message protocol using LZ4 or ZSTD #15325

Closed asfimport closed 4 years ago

asfimport commented 7 years ago

It may be useful if data is to be sent over the wire to compress the data buffers themselves as their being written in the file layout.

I would propose that we keep this extremely simple with a global buffer compression setting in the file Footer. Probably only two compressors worth supporting out of the box would be zlib (higher compression ratios) and lz4 (better performance).

What does everyone think?

Reporter: Wes McKinney / @wesm Assignee: Wes McKinney / @wesm

Subtasks:

Note: This issue was originally created as ARROW-300. Please see the migration documentation for further details.

asfimport commented 7 years ago

Wes McKinney / @wesm: It may make sense to limit to compressors designed for fast decompression performance: snappy, zstd, lz4. High compression ratios might be less interesting, but I'm interested in more feedback on use cases.

asfimport commented 7 years ago

Uwe Korn / @xhochy: +1 Compression makes sense to me and also the list of initial algorithms. High compression ratios probably only make sense once you have cross-datacenter traffic.

asfimport commented 7 years ago

Julien Le Dem / @julienledem: I'm thinking that we don't really need to compress each buffer independently and compression could be just an encapsulation at the transport level. It sounds like we don't want to exchange compressed buffers in memory (without sending them on the wire/disk).

In Parquet, columns can be decompressed independently because they can be retrieved independently. In Arrow, the entire RecordBatch corresponds to a request and will be entirely compressed and decompressed every-time. Which means we can just have the entire batch compressed together.

For simplicity I'd vote to not have compression in the Schema metadata. https://github.com/apache/arrow/blob/2f84493371bd8fae30b8e042984c9d6ba5419c5f/format/Message.fbs#L186 That's one less thing to worry about for implementors.

We can have compression in transport level (RPC, file format, ...) As for the supported compressors I would vote for SNAPPY and GZIP (zlib) to start with as they provide the 2 options you describe (higher comp or higher throughput) and SNAPPY is easier to use from Java than LZO (lz4).

asfimport commented 7 years ago

Wes McKinney / @wesm: One issue with doing compression only at the transport level is if people use the Arrow memory layout and metadata to create file formats for storing larger amounts of data. For example, I would like to deprecate the Feather metadata https://github.com/wesm/feather/blob/master/cpp/src/feather/metadata.fbs and use only the Arrow metadata. Unless you support column/buffer-level compression, then it would be expensive to read only a subset of the file. You could argue that such data should be stored as Parquet instead, but it does offer a flexibility that's really appealing (particularly since random access on memory-mapped Arrow-like data would be possible).

asfimport commented 7 years ago

Uwe Korn / @xhochy: I'm not so sure about the benefit of a compressed arrow file format. For me the main distinction is that Parquet provides efficient storage (with the tradeoff of not being to randomly access a single row) and Arrow random access, both for columnar data.

The one point where I see an Arrow file format as beneficial is where you need random access to its data but cannot load it fully into RAM but instead use a memory mapped file. If you add compression (either column-wise or whole-file level), you cannot memorymap it anymore.

The only point where I can see that having columnar compression for Arrow batches is better than on the whole batch layer is that it actually produces better compression behaviour. This means that doing compression on a per-column basis can be parallelised independently of the underyling algorithm thus leading to better CPU usage. Furthermore the compression may be better if done on a column level (with a sufficient number of rows) as the data inside a column is very similar thus leading to smaller compression dictionaries and better compresssion ratios at the end. Both things mentioned are just assumptions that should be tested before being implemented.

asfimport commented 7 years ago

Wes McKinney / @wesm: I'd like to add very simple LZ4/ZSTD compression support (at the buffer level only) in the stream and file binary formats. I can make a proposal, but want to see if anyone has any more opinions before I do?

asfimport commented 7 years ago

Kazuaki Ishizaki / @kiszk: Current Apache Spark supports the following compression schemes for in-memory columnar storage. Currently, compressed in-memory columnar storage is used when DataFrame.cache or Dataset.cache method is executed. Would it be possible to support these schemes in addition to LZ4/(current)DictonaryEncoding?

asfimport commented 7 years ago

Uwe Korn / @xhochy: Adding methods like RLE- or Delta-encoding brings us very much in the space of Parquet. Given that some of these methods are really fast, it might make sense to support them for IPC. But then I fear that we will get very much in a region where there is no clear distinction between Arrow and Parquet anymore.

asfimport commented 7 years ago

Wes McKinney / @wesm: @kiszk I agree that having in-memory compression schemes like in Spark is a good idea, in addition to simpler snappy/lz4/zlib buffer compression. Would you like to make a proposal for improvements to the Arrow metadata to support these compression schemes? We should indicate that Arrow implementations are not required to implement these in general, so for now they can be marked as experimental and optional for implementations (e.g. we wouldn't necessarily integration test them). For scan-based in-memory columnar workloads, these encodings can yield better scan throughput because of better cache efficiency, and many column-oriented databases rely on this to be able to achieve high performance, so having it natively in the Arrow libraries seems useful.

asfimport commented 7 years ago

Kazuaki Ishizaki / @kiszk: @wesm Thank you for your kindly and positive comment. I will work for preparing a proposal (It would take some time since I have to prepare a presentation for GTC, too). @xhochy IIUC, Parquet is used for a persistent file. Arrow is used for in-memory format.

What level of proposal do you expect? For example,

Also, will that proposal be posted into another JIRA entry or a comment in this JIRA entry?

asfimport commented 7 years ago

Wes McKinney / @wesm: I'm sorry for the delay. With the 0.3 Arrow release done, it would be good to make a push on compression and encoding.

How about we start a Google Document that supports public comments and you can give edit support to whomever you like? Once we agree on the design, one of us can make a pull request containing the Flatbuffer metadata for the compression / encoding details. Does that sound good?

asfimport commented 7 years ago

Kazuaki Ishizaki / @kiszk: Thank you for your response. I was also busy for preparing materials for GTC. It is good time to make a document, now.

It sounds good to prepare a Google document for collecting public comments. I will start creating a document for purpose, scope, and design.

asfimport commented 6 years ago

Wes McKinney / @wesm: We have had all of the pieces in place in C++ that we need to do this since 0.6.0. I will propose metadata extensions to support compressed record batches and a trial implementation

asfimport commented 6 years ago

Wes McKinney / @wesm: Moving to 0.9.0. I don't think I can get to this in the next 2 weeks. Help would be appreciated

asfimport commented 6 years ago

Lawrence Chan / @llchan: What did we decide with this? Imho there's still a use case for compressed arrow files due to the limited storage types in parquet. I don't really love the idea of storing 8-bit or 16-bit ints in an INT32 and hand waving it away with compression. I tried to hack it up with FixedLenByteArray but there are a slew of complications with that, not to mention alignment concerns etc.

Anyways I'm happy to help on this, but I'm not familiar enough with the code base to place it in the right spot. If we make a branch with some TODOs/placeholders I can probably plug in more easily.

asfimport commented 6 years ago

Wes McKinney / @wesm: We haven't done any work on this yet. I think the first step would be to propose additional metadata (in the Flatbuffers files) for record batches to indicate the style of compression.

asfimport commented 5 years ago

Wes McKinney / @wesm: Moving this to 0.12. I will make a proposal for compressed record batches after the 0.11 release goes out.

My gut instinct on this would be to create a CompressedBuffer metadata type and a CompressedRecordBatch message. Some reasons:

asfimport commented 4 years ago

Yuan Zhou / @zhouyuan: Hi @wesm, thanks for providing the general idea, I'm quite interested in this feature. Do you happen to have some updates on the detail proposal?

Cheers, -yuan

asfimport commented 4 years ago

Wes McKinney / @wesm: There are some discussions on going at https://lists.apache.org/thread.html/a99124e57c14c3c9ef9d98f3c80cfe1dd25496bf3ff7046778add937@%3Cdev.arrow.apache.org%3E

asfimport commented 4 years ago

Wes McKinney / @wesm: Issue resolved by pull request 6707 https://github.com/apache/arrow/pull/6707