apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.3k stars 3.48k forks source link

[JS] Implement IPC RecordBatch body buffer compression from ARROW-300 #24833

Open asfimport opened 4 years ago

asfimport commented 4 years ago

This may not be a hard requirement for JS because this would require pulling in implementations of LZ4 and ZSTD which not all users may want

Reporter: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-8674. Please see the migration documentation for further details.

asfimport commented 3 years ago

eric mauviere: I strongly suport this because if library size is a concern (but a small one), loading a 10 mb lz4 compressed arrow file, rather than a 100 mb one, is much more crucial!

asfimport commented 3 years ago

Dominik Moritz / @domoritz: Yes, but at the same time someone might want to use Arrow with a small file. I don't think we want to increase the bundle size for everyone. I would prefer an optional (external) module instead if the bundle size increases significantly. I think I would like to see some numbers (file sizes) before making the call one way or another.

asfimport commented 2 years ago

Kyle Barron: Hello! I'd like to revisit this issue and potentially submit a PR for this.

I think there are various reasons why we might not want to pull in LZ4 and ZSTD implementations by default:

asfimport commented 2 years ago

Dominik Moritz / @domoritz: Looking at lz4js, it's so small (https://cdn.jsdelivr.net/npm/lz4js@0.2.0/lz4.min.js) that it's probably okay to pull in a dependency by default. I agree that having some system to register a different decompress function could be nice. lz4js is a bit old so we would want to carefully look at the available libraries. It would be nice to have some out of the box support. To avoid increasing bundle sizes, we can decide which functions actually use the decompression library.

Could you look at the available js libraries and see what their sizes are? Also, is lz4 or zstd much more common than the other?

We also should look into how much benefit we actually get from compression since most servers already support transparent gzip compression and so compressing an already compressed file will just incur overhead.

If the libraries are too heavy, we can think about a plugin system. We could make our registry be synchronous.

I definitely don't want to pull in wasm into the library as it will break people's workflows.

asfimport commented 2 years ago

Kyle Barron: Thanks for the feedback!

We also should look into how much benefit we actually get from compression since most servers already support transparent gzip compression and so compressing an already compressed file will just incur overhead. I think there are several reasons why it's important to support compressed files:

  • Popular tools in the ecosystem write data with compression turned on by default. I'm specifically looking at Pyarrow/Pandas, which writes LZ4-compressed files by default. If a web app wants to display arrow data from unknown sources, having some way to load all files is ideal.
  • It's true that servers usually offer transparent gzip compression, but there are reasons why a user wouldn't want that. For one, gzip compression is much slower than LZ4 or ZSTD compression. An example below using this file of writing a 753MB Arrow table to a memory buffer uncompressed, then using the standard library's gzip.compress took {}2m46s{}. The Python interface is slower than the gzip command line, but time gzip -c uncompressed_table.arrow > /dev/null still took {}36s{}. Meanwhile, using LZ4 output took only 1.48s and ZSTD output took only {}1.63s{}. In this example, the LZ4 file was 75% larger than the gzip file, but the ZSTD one was 6% smaller than the gzip one. Of course this is just one example, but it at least gives credence to times when a developer would prefer lz4 or zstd over gzip.
  • I think supporting compression in tableToIPC would be quite valuable for any use case where an app wants to push Arrow data to a server.

Looking at lz4js, it's so small that it's probably okay to pull in a dependency by default. Wow that is impressively small. It might make sense to pull that in by default. The issue tracker is mostly empty, though there is one report of data compressed by lz4js not being able to be read by other tools. I definitely don't want to pull in wasm into the library as it will break people's workflows. I agree. I'm fine with not pulling in a wasm library by default. Could you look at the available js libraries and see what their sizes are? Also, is lz4 or zstd much more common than the other? None of the ZSTD libraries I came across were pure JS. The only LZ4 one that was pure JS was lz4js. Aside from considering something like trying to transpile wasm to JS, which I think would be too complex for arrow JS, the only default I see that's possible is using lz4js while also supporting a registry. I don't know if LZ4 or ZSTD are more common. LZ4 is the default for Pyarrow when writing a table. If the libraries are too heavy, we can think about a plugin system. We could make our registry be synchronous. I think it would be possible to force the compress and decompress functions in the plugin system to be synchronous. That would just force the user to finish any async initialization before trying to read/write a file, since wasm bundles can't be instantiated synchronously I think.

 


Example of writing table to buffer uncompressed, then using gzip.compress from the Python standard library


In [37]: %%time
    ...: options = pa.ipc.IpcWriteOptions(compression=None)
    ...: with pa.BufferOutputStream() as buf:
    ...:     with pa.ipc.new_stream(buf, table.schema, options=options) as writer:
    ...:         writer.write_table(table)
    ...:
    ...:     reader = pa.BufferReader(buf.getvalue())
    ...:     reader.seek(0)
    ...:     out = gzip.compress(reader.read())
    ...:     print(len(out))
    ...:
175807183
CPU times: user 2min 41s, sys: 1.74 s, total: 2min 43s
Wall time: 2min 46s

Example of writing table to buffer with lz4 compression:


In [40]: %%time
    ...: options = pa.ipc.IpcWriteOptions(compression='lz4')
    ...: with pa.BufferOutputStream() as buf:
    ...:     with pa.ipc.new_stream(buf, table.schema, options=options) as writer:
    ...:         writer.write_table(table)
    ...:
    ...:     print(buf.tell())
313078576
CPU times: user 1.48 s, sys: 322 ms, total: 1.81 s
Wall time: 1.48 s

Example of writing table to buffer with zstd compression:


In [41]: %%time
    ...: options = pa.ipc.IpcWriteOptions(compression='zstd')
    ...: with pa.BufferOutputStream() as buf:
    ...:     with pa.ipc.new_stream(buf, table.schema, options=options) as writer:
    ...:         writer.write_table(table)
    ...:
    ...:     print(buf.tell())
166563176
CPU times: user 2.28 s, sys: 178 ms, total: 2.45 s
Wall time: 1.63 s
asfimport commented 2 years ago

Dominik Moritz / @domoritz:

For one, gzip compression is much slower than LZ4 or ZSTD compression.

Maybe. Let's make sure to compare native gzip compression that a web server uses with js lz4/zstd compression.

I think it would be possible to force the compress and decompress functions in the plugin system to be synchronous. That would just force the user to finish any async initialization before trying to read/write a file, since wasm bundles can't be instantiated synchronously I think.

It would unfortunately also preclude people from putting decompression into a worker. Maybe we can make the relevant IPC methods return return promises when the compression/decompression method is async (returns a promise).

None of the ZSTD libraries I came across were pure JS. The only LZ4 one that was pure JS was lz4js.

We could consider inlining the wasm code with base64 if it's tiny but I suspect it will not. Worth considering, though.

Anyway, I think it makes sense to work on this and send a pull request. We should definitely have a way to pass in/register compression algorithms. Then let's look into whether we want to bundle any algorithms. Let's start with lz4 and try a few libraries (e.g. https://github.com/gorhill/lz4-wasm, https://github.com/Benzinga/lz4js, https://github.com/pierrec/node-lz4). If they are small enough, I would consider including a default lz4 implementation. Sounds good?

asfimport commented 2 years ago

Dominik Moritz / @domoritz: https://github.com/manzt/numcodecs.js looks interesting as well. It used wasm inlined lz4.

asfimport commented 2 years ago

Kyle Barron:

 Maybe. Let's make sure to compare native gzip compression that a web server uses with js lz4/zstd compression.

I'm most familiar with fastapi, which is probably the third most-popular Python web server framework after Django and Flask. Its suggested gzip middleware uses the standard library's gzip implementation so I don't think my example above was completely out of place. The lzbench native benchmarks still have lz4 and zstd as 4-6x faster than zlib.

But I think these performance discussions are more of a side discussion; given that the Arrow IPC format allows for compression, I'd love to find a way for Arrow JS to support these files.

It would unfortunately also preclude people from putting decompression into a worker. Maybe we can make the relevant IPC methods return return promises when the compression/decompression method is async (returns a promise).

That's a very good point. If we implement a registry of some sort, we could consider allowing both sync and async of  compression. Then the RecordBatchReader could use sync compression and AsyncRecordBatchReader could use the async compression. So if the user wants to use de/compression on a worker they would be able to use the AsyncRecordBatchReader. Not sure if that's a great idea; but having a synchronous tableFromIPC option is nice.

If they are small enough, I would consider including a default lz4 implementation. Sounds good?

Sounds good! I'll try to find time soon to put up a draft.

vivek1729 commented 6 months ago

Hey folks! Great to see the discussion on this issue. We've been looking at using arrow to send some data over the wire from C# to JS. Right now when we try to read record batches from the stream, we see an error Record batch compression not implemented.

It seems like the js library is the only one missing buffer compression support compared to other libraries. This can essentially block interop when passing arrow data from another language to js, with compression enabled.

What's the extent of compression support in the js library right now? I did see a file called body-compression.ts but I am not sure if it's more of a stub at this point. Is this support planned to be added any time soon? It'll complete the ipc compatibility matrix for the libraries in different languages. Alternatively, if we wanted to support de-compressing buffers in JS, what's the best way to approach that?