Closed Skielex closed 4 years ago
Yes. The reason is that the limit is established as part of the Snappy spec:
The stream starts with the uncompressed length (up to a maximum of 2^32 - 1), stored as a little-endian varint.
Looks that way to me: https://github.com/google/snappy/blob/f16eda3466633b88d0a55199deb00aa5429c6219/snappy.cc#L1045
It's explicitly only encoding a 32-bit integer, although it does look like N
in that context could be bigger than that, so they might have a bug there. Not sure. There might be an invariant elsewhere that guarantees that reader->Available()
never returns anything that can't fit into a uint32_t
.
And on its decompression side, it clearly only supports compressed lengths up to 32-bits: https://github.com/google/snappy/blob/f16eda3466633b88d0a55199deb00aa5429c6219/snappy.cc#L1741-L1747
So I'm not really sure why you're not seeing the same limitation, but I see it in the code. Where are you seeing the absence of this limitation? Do you have an example program that demonstrates that the reference C++ Snappy implementation can produce raw compressed blocks from more than 2^32
bytes of data?
for compressing some large arrays and ran into this limitation
It's not clear to me why you aren't using the Snappy frame format. The Snappy "raw" format should generally be avoided unless you have small input sizes. Otherwise, you have to load everything into memory, and you're also subjected to the 2^32
size limit. The Snappy frame format can compress arbitrarily large data with constant memory. The cramjam
library you're using even exposes snappy_compress
and snappy_compress_raw
. Although, its API does require you to load your entire data into memory, but that's a failing of cramjam
and not of this crate which works on streams.
Thanks for clarifying. The reason I was using raw compression was that the code which was reading the data (which I didn't write) was using raw decompression.
I tested out the python-snappy
library for compression too, which uses the C++ implementation. It wasn't complaining about size when I passed in more than 2^32 - 1 bytes. However it turned out that it actually "currupts" the data, which I only discovered when I tried using the data. Clearly throwing an exception up front as you do is much better!
Is there a reason the the MAX_INPUT_SIZE limit?
I was testing out the
cramjam
Python package (which uses this Rust implementation) for compressing some large arrays and ran into this limitation. The program reading the data relies on the C++ snappy implementation, but there I don't see the same limitation.