Is there a reason the the MAX_INPUT_SIZE limit?

Skielex commented 4 years ago

/// We don't permit compressing a block bigger than what can fit in a u32.
const MAX_INPUT_SIZE: u64 = std::u32::MAX as u64;

I was testing out the cramjam Python package (which uses this Rust implementation) for compressing some large arrays and ran into this limitation. The program reading the data relies on the C++ snappy implementation, but there I don't see the same limitation.

BurntSushi commented 4 years ago

Yes. The reason is that the limit is established as part of the Snappy spec:

The stream starts with the uncompressed length (up to a maximum of 2^32 - 1), stored as a little-endian varint.

Looks that way to me: https://github.com/google/snappy/blob/f16eda3466633b88d0a55199deb00aa5429c6219/snappy.cc#L1045

It's explicitly only encoding a 32-bit integer, although it does look like N in that context could be bigger than that, so they might have a bug there. Not sure. There might be an invariant elsewhere that guarantees that reader->Available() never returns anything that can't fit into a uint32_t.

And on its decompression side, it clearly only supports compressed lengths up to 32-bits: https://github.com/google/snappy/blob/f16eda3466633b88d0a55199deb00aa5429c6219/snappy.cc#L1741-L1747

So I'm not really sure why you're not seeing the same limitation, but I see it in the code. Where are you seeing the absence of this limitation? Do you have an example program that demonstrates that the reference C++ Snappy implementation can produce raw compressed blocks from more than 2^32 bytes of data?

for compressing some large arrays and ran into this limitation

It's not clear to me why you aren't using the Snappy frame format. The Snappy "raw" format should generally be avoided unless you have small input sizes. Otherwise, you have to load everything into memory, and you're also subjected to the 2^32 size limit. The Snappy frame format can compress arbitrarily large data with constant memory. The cramjam library you're using even exposes snappy_compress and snappy_compress_raw. Although, its API does require you to load your entire data into memory, but that's a failing of cramjam and not of this crate which works on streams.

Skielex commented 4 years ago

Thanks for clarifying. The reason I was using raw compression was that the code which was reading the data (which I didn't write) was using raw decompression.

I tested out the python-snappy library for compression too, which uses the C++ implementation. It wasn't complaining about size when I passed in more than 2^32 - 1 bytes. However it turned out that it actually "currupts" the data, which I only discovered when I tried using the data. Clearly throwing an exception up front as you do is much better!

BurntSushi / rust-snappy

Is there a reason the the MAX_INPUT_SIZE limit? #33