BurntSushi / rust-snappy

Snappy compression implemented in Rust (including the Snappy frame format).
BSD 3-Clause "New" or "Revised" License
441 stars 42 forks source link

Seems to be incompatible with python and java versions of snappy #23

Closed rodya-mirov closed 4 years ago

rodya-mirov commented 5 years ago

A simple working example demonstrating szip and python's python-snappy wheel are incompatible:

data = open('data.json', 'r').read()
c = snappy.compress(data.encode('UTF-8'))
open('data.json.sz', 'wb').write(c)```
* This results in a binary file on disk. Running `szip -d data.json.sz` yields: 
```$ szip -d out.json.sz
out.json.sz: snappy: corrupt input (expected stream header but got unexpected chunk type byte 148)```
* However python's snappy thinks the file is fine; the following gives the original contents right back: 
```import snappy
snappy.decompress(open('data.json.sz', 'rb').read())```

So szip cannot interpret python's compressed output. The reverse is also true:

* Assume a file "data.json" exists (and data.json.sz does not)
* Run `szip data.json`
* In python3, write 
```import snappy
data = open('data.json.sz', 'rb').read()
c = snappy.decompress(data)``` to get this error:

Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.7/site-packages/snappy/snappy.py", line 92, in uncompress return _uncompress(data) snappy.UncompressError: Error while decompressing: invalid input```

I should add this isn't just a quirk of python; the behavior of the Java library I'm using ( https://github.com/xerial/snappy-java ) works fine and is interoperable with python's snappy. So this seems to be the odd one out.

I don't know much about the internals of this algorithm but maybe it's possible they're implementing different specifications of the snappy format?

BurntSushi commented 5 years ago

szip uses the snappy frame format by default. It's not clear to me from this bug repoet whether you've accounted for that or not. Sounds like you should be using the stream compression functions in python-snappy, or by using the raw format in szip.

rodya-mirov commented 5 years ago

That's a good point. When I started with this I didn't realize snappy had two formats (framed and non-framed). For whatever reason this project defaults to framed and the python and java projects default to non-framed, and it's not obvious (until pointed out) how to deal with that.

I've converted everything to framed and it seems to work quite well, so I'll close the issue. Thanks.

BurntSushi commented 5 years ago

For whatever reason this project defaults to framed

You're comparing apples and oranges. You're comparing the output of a command line tool with that of a library function. If the command line tool defaulted to the raw snappy format, then it would mysteriously fail on any file bigger than what can fit inside a single snappy frame. That's not good. So for a command line tool, it defaults to the frame format.

In terms of library design, they are very similar. Both provide "raw" (de)compression along with stream (de)compression: https://docs.rs/snap/0.2.5/snap/

Anyway, glad to hear it's working now.

rodya-mirov commented 5 years ago

Yes, I understand the problem now. And to be clear I do not mean to imply any of the libraries are designed incorrectly! But at the time I was trying to make this work, I thought Snappy was a "single" compression algorithm / format, but (I have since learned) there are actually two ("raw" and "streaming" to match your terminology) which are not cross compatible.

(putting more details so this could help someone else later on)

The confusing bit was the "most obvious code" for each project gives different results, and can be written without realizing there are multiple options:

Point being, it's quite easy not to realize there's a difference and just get incompatible objects (which is what happened to me; I did Snappy.compress on my Java webserver and tried to decompress it in a Rust process, it failed, I tried to decompress those same bytes in python and it worked, so I filed a bug here). All of these libraries are well documented and it's fine to get whichever version you wanted, but only after you understand the problem.

Anyway thanks for the great library, we've already gotten huge performance gains by replacing gzip with snappy in production and we're all quite thrilled.

BurntSushi commented 5 years ago

Thanks for elaborating! But yeah, I've always been confused about the focus on the raw format in other snappy libraries. It's almost more like an implementation detail to me. It's truthfully only exposed in this library at all because every other snappy library exposes it.

rodya-mirov commented 4 years ago

Hi @BurntSushi I just read your blog post and it really spoke to me, so I would like to say a little bit about how we used this crate at TriNetX and how much it helped us.

We originally had a python microservice that did some CPU- and memory-intensive data munging. Somewhat suddenly, the size of data it needed to process multiplied by a factor of 20, and it simply couldn't keep up. Not wanting to scale up our instance size for no reason, we rewrote it in Rust. The memory benefits were immense, but the total processing time didn't go down enough. Profiling showed that 60% of our time was spent gzipping and g-unzipping data, which is simply not acceptable.

A few google searches later and we discovered snappy, but didn't realize that snappy was actually two incompatible formats (framed and un-framed) since the python and java libraries both had compatible defaults. When I got to getting a rust snappy library, the bytes simply weren't compatible, and there was no obvious reason why, so I made this issue.

In a very short span of time, you helped me to diagnose the issue, understand the underlying issues, and fix everything so that it's compatible. Your library now works beautifully in production, and upon profiling, we spend approximately 5% of our time de/compressing data, since your library is so fast. We were able to significantly improve the user experience and increase the amount of data they can request at a time, allowing them to do better and more legitimate medical data analyses.

I don't know if this issue was one of the "low effort negative issues" that you get faced with, or simply some frustrating homework dropped on you by a stranger on the internet, but your crate -- and therefore you as a maintainer -- had a huge positive effect on me, and on my team. Thank you.

BurntSushi commented 4 years ago

@rodya-mirov Thank you so much for the kind words! I really appreciate it. I always love hearing about use cases like that. It's very cool to hear what folks are doing.

I don't know if this issue was one of the "low effort negative issues" that you get faced with, or simply some frustrating homework dropped on you by a stranger on the internet, but your crate -- and therefore you as a maintainer -- had a huge positive effect on me, and on my team. Thank you.

No worries! I really didn't mean for it to come across as strongly as I think it ultimately did. It's certainly on me to work through as well, and get better at responding well to issues. I also didn't mean to imply that all "low effort" issues are bad. Sometimes you just don't know what you don't know, which I think is probably where this issue came from. I'd much rather people file the issue than worry too much about whether they put too much effort into it. :-) So thank you!

martindurant commented 3 years ago

I'd just like to point out that Parquet uses snappy-raw, because it has internal "framing" (known as pages). Parquet might just be the primary user of snappy compression! So no, I don't think it's an implementation detail, and it is well worth making it clear in docs.

To clarify the situation, snappy was not originally written as a file compressor, but for packets on-the-line or blocks in files (such as parquet). This is the reason for the naming in the other packages. The CLI utility came later, and changed the name landscape.

(I am a maintainer of python-snappy, and author of fastparquet)

BurntSushi commented 3 years ago

Could you say what is unclear from the docs? Here are the relevant paragraphs from the main page:

This crate provides two ways to use Snappy. The first way is through the read::FrameDecoder and write::FrameEncoder types, which implement the std::io::Read and std::io::Write traits with the Snappy frame format. Unless you have a specific reason to the contrary, you should only use the Snappy frame format. Specifically, the Snappy frame format permits streaming compression or decompression.

The second way is through the raw::Decoder and raw::Encoder types. These types provide lower level control to the raw Snappy format, and don't support a streaming interface directly. You should only use these types if you know you specifically need the Snappy raw format.

milesgranger commented 3 years ago

Is it reasonable to open an issue to have De/Encoders within raw to implement Read/Write for this particular use case?

martindurant commented 3 years ago

Since users will likely have come across snappy first elsewhere, I would explicitly reference the alternate names (e.g., you need to dig into https://github.com/google/snappy to find what framed means at all) and make these paragraphs two bullets following the "two ways".

I would replace the "specific reason" with language like "this option is appropriate for...". There are plenty of reasons to want the simpler function; primarily: it is faster!

BurntSushi commented 3 years ago

@milesgranger what use case?

@martindurant

I would replace the "specific reason" with language like "this option is appropriate for...". There are plenty of reasons to want the simpler function; primarily: it is faster!

Sure. That might be a specific reason to use the raw format? The language seems fine?

I would explicitly reference the alternate names

Sorry, which alternate names are you referring to? Is the frame format called something other than what its specification calls it?

milesgranger commented 3 years ago

Sorry, should have been more clear about the use case.

Currently, we're exploring using this lib as the backend for fastparquet; which as described here, concentrates on using snappy raw. For framed De/Encoders, we can do a single allocation into Python and then de/compress bytes directly into that buffer. Whereas with raw, it requires allocation on the rust side and another into Python.

BurntSushi commented 3 years ago

@milesgranger Sounds like a good use case to support, but by my understanding, the current API already supports it. Both of the raw encoders and decoders have zero allocation APIs. Adding Read or Write impls wouldn't add any new functionality from where I'm standing. Moreover, since the raw APIs specifically don't support streams, adding Read/Write impls seems incorrect to me.

I think you might need to be more specific on your problem?

milesgranger commented 3 years ago

Great!

Okay, let's see if I can get this right.

My understanding is raw De/Encoders have de/compress which only accepts &mut [u8]; moreover, there are hard stipulations on what size that buffer should be. On our side, we often know what size it will be in de/compressed form; thus we can do an allocation into Python and pass that slice into these methods (so that would be okay with the current API; except with Python bytes type doesn't allow resizing, so we would need compress into an overallocated rust buffer, resize it to the actual length and then new allocation into Python as bytes), but sometimes a user may not and of course we can use the length calculation functions; and then it's the whole trip of allocating to rust, truncating and then into Python bit again.

So here, it's a matter of ergonomics really, plus some performance hit when working with Python bytes; not that the current API wouldn't suffice; and anyway, if the framed encoders implement Read/Write it seems natural that raw would; but that's maybe a lesser argument I suppose.

The main incompatibility is when working with Python bytearrays; via the C API we can resize these and can get the best performance. Therefore we have these scenarios:

  1. User knows the output length; current API insists we might need to over-allocate anyway and compress into that.
  2. User doesn't know (the main incompatibility without Read/Write); we start from zero, and by us implementing Write on our types, we de/compress data into the bytearray, resizing as needed. Okay, raw doesn't support streams, we basically get one buffer in, but can then make the allocation into Python within fn write(&mut self, buf: &[u8]) with exactly the right size.

In this way, having at least Read implemented on raw En/Decoders, would cover the challenges I've described.

BurntSushi commented 3 years ago

OK, this whole thing is pretty confusing. @milesgranger Your issue seems completely unrelated to this specific issue. Maybe that's why you asked to open a new issue? I really wish GitHub had a way to split threads, because this conversation is just totally unrelated to the OP here.

Currently, we're exploring using this lib as the backend for fastparquet;

Are you using Google's C++ library currently? If so, do you mind if I ask why you'd want to switch to this library?

My understanding is raw De/Encoders have de/compress which only accepts &mut [u8]; moreover, there are hard stipulations on what size that buffer should be.

Your understanding of the Rust API looks correct to me, but I can't connect the dots for why it's problematic. Before going in, I'd like to be more explicit about my confusion. In my mind, I've seen words in your comments that suggest there are two different problems that seem mutually exclusive to me:

  1. There is something artificial in the API that makes it impossible to avoid an allocation when doing FFI with Python. My understanding is that you see some path toward removing this extra allocation.
  2. The current API is technically sufficient to do whatever you need with the lowest cost possible, but doing so is not ergonomic.

Getting clarity on exactly which one of these is the problem you're facing is critical, because they are very different. However, you also say:

So here, it's a matter of ergonomics really, plus some performance hit when working with Python bytes

Which suggests both problems are present. The other problem here is that you're suggesting Read/Write impls as a solution to your problem, but: 1) Read/Write impls can exist in both directions (for example, this crate provides Read impls for both frame encoders and decoders) and 2) I believe, as a matter of at least theoretical exercise, such impls could be implemented outside of the crate. In my mind, that means that while (2) above may be a problem, (1) cannot be if my assumptions are correct. (I say this not necessarily to say, "hah, you're wrong!" But rather, to elucidate my understanding of your words and make it easier for you to fix my understanding. Hopefully. Hah.)

Maybe the problem is that I'm not terribly familiar with the specifics of Python's FFI facilities. I've used them before, but it's been a long time.

Okay, with that out of the way, I'm going to do my best to respond to some of your details. But now at least you know where my brain is at.

On our side, we often know what size it will be in de/compressed form; thus we can do an allocation into Python and pass that slice into these methods (so that would be okay with the current API; except with Python bytes type doesn't allow resizing, so we would need compress into an overallocated rust buffer, resize it to the actual length and then new allocation into Python as bytes), but sometimes a user may not and of course we can use the length calculation functions; and then it's the whole trip of allocating to rust, truncating and then into Python bit again.

OK, so bytes represents an allocation that cannot be changed. My understanding is that you should be able to create a bytes allocation and then pass a pointer to that allocation to the Rust APIs in this crate as a &mut [u8]. Assuming that's true, then I believe decompression is solved because you can tell the exact decompressed size from valid compressed snappy bytes. Thus, you can allocate a bytes of exactly the right size.

Compression is more problematic though, and in order to return a bytes of exactly the right size, it sounds to me like you must have some kind of resizeable scratch buffer to write the compressed bytes into initially, and then copy the compressed data into a bytes once compression is complete. Like, I don't think there is any other way, right? There is no API this crate could provide that would avoid that extra copying given your constraints because you don't know the exact size of the compressed data before compressing it.

Are you somehow able to avoid this extra copying/allocation in the current implementation using the C++ library? If so, maybe you could point me to that code. Perhaps there is some kind of miscommunication here that the code will clarify.

if the framed encoders implement Read/Write it seems natural that raw would; but that's maybe a lesser argument I suppose

That was kind of my point above: it isn't natural for the raw encoders to implement Read/Write because the raw encoders don't represent streams. In particular, the key thing that the current API makes abundantly clear is that when you use the raw snappy format, all of the bytes (whether compressed or not) need to be in memory in a contiguous fashion. There is no alternative to this based on how the raw format works. So implementing streaming traits for this sort of thing would be quite weird.

User knows the output length; current API insists we might need to over-allocate anyway and compress into that.

That's interesting. Does the C++ library not have this limitation? And moreover, how does one know the output length before compressing the data? The output length isn't a stable part of any Snappy implementation as far as I know, even if it remains pretty stable in practice. That is, compression can in theory improve (or regress), which would therefore change the output length.

  1. User doesn't know (the main incompatibility without Read/Write); we start from zero, and by us implementing Write on our types, we de/compress data into the bytearray, resizing as needed. Okay, raw doesn't support streams, we basically get one buffer in, but can then make the allocation into Python within fn write(&mut self, buf: &[u8]) with exactly the right size.

I don't think I quite understand this. If you don't know the output size, which is what I would expect when compressing in basically all cases, then you allocate a bytearray corresponding to the max compressed size and then compress into that. It sounds like bytearray can be resized, so then you can truncate it at this point? I'm just not quite understanding what implementing Write gives you here.

milesgranger commented 3 years ago

Agreed, I came here advocating a similar API for raw as there is for framed; it was at least on topic for this thread, but it has anyway descended into a narrow winding road. Therefore, I'll concede to your expertise and take some time to reflect on your informative response and probably life choices in general...

If I boil down a feature request for a specific use case, I'll open a new issue like a normal person. :wink:

Thanks for your patience.

martindurant commented 3 years ago

Thanks @BurntSushi for taking the time to answer here, and of course for making the library in the first case.

BurntSushi commented 3 years ago

@milesgranger @martindurant No problem. Please feel free to open new issues. I might be missing something myself. I'm on parental leave and I'm quite sleep deprived.

Also, to be clear, the original topic of this issue was that someone was using the framing format when they should have been using the raw format (or vice versa). Asking for Read/Write impls for the raw APIs is totally different from that. :-)