Open asfimport opened 4 years ago
eric mauviere: I strongly suport this because if library size is a concern (but a small one), loading a 10 mb lz4 compressed arrow file, rather than a 100 mb one, is much more crucial!
Dominik Moritz / @domoritz: Yes, but at the same time someone might want to use Arrow with a small file. I don't think we want to increase the bundle size for everyone. I would prefer an optional (external) module instead if the bundle size increases significantly. I think I would like to see some numbers (file sizes) before making the call one way or another.
Kyle Barron: Hello! I'd like to revisit this issue and potentially submit a PR for this.
I think there are various reasons why we might not want to pull in LZ4 and ZSTD implementations by default:
At least one LZ4 implementation is in pure JS, with no WASM components. Some users may prefer a pure JS library for simplicity.
How would others feel about a codec registry system? Something like what Zarr.js allows, where you can dynamically register codecs on demand.
The arrow.tableFromIPC
function is currently synchronous, so unless we changed that function to be async, we wouldn't be able to import the codec after seeing that a data file has a given compression, because a dynamic import would have to be async.
In terms of implementation, I'd expect it to be relatively straightforward? Presumably look to update decodeBuffers
here: https://github.com/apache/arrow/blob/b67e3c8ef1e173e1840c4fa897b7c6c493932e10/js/src/ipc/metadata/message.ts#L303.
References:
LZ4 implementations:
https://github.com/Benzinga/lz4js
ZSTD implementations:
Dominik Moritz / @domoritz: Looking at lz4js, it's so small (https://cdn.jsdelivr.net/npm/lz4js@0.2.0/lz4.min.js) that it's probably okay to pull in a dependency by default. I agree that having some system to register a different decompress function could be nice. lz4js is a bit old so we would want to carefully look at the available libraries. It would be nice to have some out of the box support. To avoid increasing bundle sizes, we can decide which functions actually use the decompression library.
Could you look at the available js libraries and see what their sizes are? Also, is lz4 or zstd much more common than the other?
We also should look into how much benefit we actually get from compression since most servers already support transparent gzip compression and so compressing an already compressed file will just incur overhead.
If the libraries are too heavy, we can think about a plugin system. We could make our registry be synchronous.
I definitely don't want to pull in wasm into the library as it will break people's workflows.
Kyle Barron: Thanks for the feedback!
We also should look into how much benefit we actually get from compression since most servers already support transparent gzip compression and so compressing an already compressed file will just incur overhead. I think there are several reasons why it's important to support compressed files:
- Popular tools in the ecosystem write data with compression turned on by default. I'm specifically looking at Pyarrow/Pandas, which writes LZ4-compressed files by default. If a web app wants to display arrow data from unknown sources, having some way to load all files is ideal.
- It's true that servers usually offer transparent gzip compression, but there are reasons why a user wouldn't want that. For one, gzip compression is much slower than LZ4 or ZSTD compression. An example below using this file of writing a 753MB Arrow table to a memory buffer uncompressed, then using the standard library's
gzip.compress
took {}2m46s{}. The Python interface is slower than the gzip command line, buttime gzip -c uncompressed_table.arrow > /dev/null
still took {}36s{}. Meanwhile, using LZ4 output took only 1.48s and ZSTD output took only {}1.63s{}. In this example, the LZ4 file was 75% larger than the gzip file, but the ZSTD one was 6% smaller than the gzip one. Of course this is just one example, but it at least gives credence to times when a developer would prefer lz4 or zstd over gzip.- I think supporting compression in
tableToIPC
would be quite valuable for any use case where an app wants to push Arrow data to a server.Looking at lz4js, it's so small that it's probably okay to pull in a dependency by default. Wow that is impressively small. It might make sense to pull that in by default. The issue tracker is mostly empty, though there is one report of data compressed by lz4js not being able to be read by other tools. I definitely don't want to pull in wasm into the library as it will break people's workflows. I agree. I'm fine with not pulling in a wasm library by default. Could you look at the available js libraries and see what their sizes are? Also, is lz4 or zstd much more common than the other? None of the ZSTD libraries I came across were pure JS. The only LZ4 one that was pure JS was lz4js. Aside from considering something like trying to transpile wasm to JS, which I think would be too complex for arrow JS, the only default I see that's possible is using lz4js while also supporting a registry. I don't know if LZ4 or ZSTD are more common. LZ4 is the default for Pyarrow when writing a table. If the libraries are too heavy, we can think about a plugin system. We could make our registry be synchronous. I think it would be possible to force the
compress
anddecompress
functions in the plugin system to be synchronous. That would just force the user to finish any async initialization before trying to read/write a file, since wasm bundles can't be instantiated synchronously I think.
Example of writing table to buffer uncompressed, then using gzip.compress
from the Python standard library
In [37]: %%time
...: options = pa.ipc.IpcWriteOptions(compression=None)
...: with pa.BufferOutputStream() as buf:
...: with pa.ipc.new_stream(buf, table.schema, options=options) as writer:
...: writer.write_table(table)
...:
...: reader = pa.BufferReader(buf.getvalue())
...: reader.seek(0)
...: out = gzip.compress(reader.read())
...: print(len(out))
...:
175807183
CPU times: user 2min 41s, sys: 1.74 s, total: 2min 43s
Wall time: 2min 46s
Example of writing table to buffer with lz4 compression:
In [40]: %%time
...: options = pa.ipc.IpcWriteOptions(compression='lz4')
...: with pa.BufferOutputStream() as buf:
...: with pa.ipc.new_stream(buf, table.schema, options=options) as writer:
...: writer.write_table(table)
...:
...: print(buf.tell())
313078576
CPU times: user 1.48 s, sys: 322 ms, total: 1.81 s
Wall time: 1.48 s
Example of writing table to buffer with zstd compression:
In [41]: %%time
...: options = pa.ipc.IpcWriteOptions(compression='zstd')
...: with pa.BufferOutputStream() as buf:
...: with pa.ipc.new_stream(buf, table.schema, options=options) as writer:
...: writer.write_table(table)
...:
...: print(buf.tell())
166563176
CPU times: user 2.28 s, sys: 178 ms, total: 2.45 s
Wall time: 1.63 s
Dominik Moritz / @domoritz:
For one, gzip compression is much slower than LZ4 or ZSTD compression.
Maybe. Let's make sure to compare native gzip compression that a web server uses with js lz4/zstd compression.
I think it would be possible to force the
compress
anddecompress
functions in the plugin system to be synchronous. That would just force the user to finish any async initialization before trying to read/write a file, since wasm bundles can't be instantiated synchronously I think.
It would unfortunately also preclude people from putting decompression into a worker. Maybe we can make the relevant IPC methods return return promises when the compression/decompression method is async (returns a promise).
None of the ZSTD libraries I came across were pure JS. The only LZ4 one that was pure JS was lz4js.
We could consider inlining the wasm code with base64 if it's tiny but I suspect it will not. Worth considering, though.
Anyway, I think it makes sense to work on this and send a pull request. We should definitely have a way to pass in/register compression algorithms. Then let's look into whether we want to bundle any algorithms. Let's start with lz4 and try a few libraries (e.g. https://github.com/gorhill/lz4-wasm, https://github.com/Benzinga/lz4js, https://github.com/pierrec/node-lz4). If they are small enough, I would consider including a default lz4 implementation. Sounds good?
Dominik Moritz / @domoritz: https://github.com/manzt/numcodecs.js looks interesting as well. It used wasm inlined lz4.
Maybe. Let's make sure to compare native gzip compression that a web server uses with js lz4/zstd compression.
I'm most familiar with fastapi, which is probably the third most-popular Python web server framework after Django and Flask. Its suggested gzip middleware uses the standard library's gzip implementation so I don't think my example above was completely out of place. The lzbench native benchmarks still have lz4 and zstd as 4-6x faster than zlib.
But I think these performance discussions are more of a side discussion; given that the Arrow IPC format allows for compression, I'd love to find a way for Arrow JS to support these files.
It would unfortunately also preclude people from putting decompression into a worker. Maybe we can make the relevant IPC methods return return promises when the compression/decompression method is async (returns a promise).
That's a very good point. If we implement a registry of some sort, we could consider allowing both sync and async of compression. Then the RecordBatchReader
could use sync compression and AsyncRecordBatchReader
could use the async compression. So if the user wants to use de/compression on a worker they would be able to use the AsyncRecordBatchReader. Not sure if that's a great idea; but having a synchronous tableFromIPC
option is nice.
If they are small enough, I would consider including a default lz4 implementation. Sounds good?
Sounds good! I'll try to find time soon to put up a draft.
Hey folks! Great to see the discussion on this issue. We've been looking at using arrow to send some data over the wire from C# to JS. Right now when we try to read record batches from the stream, we see an error Record batch compression not implemented
.
It seems like the js library is the only one missing buffer compression support compared to other libraries. This can essentially block interop when passing arrow data from another language to js, with compression enabled.
What's the extent of compression support in the js library right now? I did see a file called body-compression.ts
but I am not sure if it's more of a stub at this point.
Is this support planned to be added any time soon? It'll complete the ipc compatibility matrix for the libraries in different languages. Alternatively, if we wanted to support de-compressing buffers in JS, what's the best way to approach that?
This may not be a hard requirement for JS because this would require pulling in implementations of LZ4 and ZSTD which not all users may want
Reporter: Wes McKinney / @wesm
PRs and other links:
Note: This issue was originally created as ARROW-8674. Please see the migration documentation for further details.