BurntSushi / rust-snappy

Snappy compression implemented in Rust (including the Snappy frame format).
BSD 3-Clause "New" or "Revised" License
444 stars 43 forks source link

Add analogue to `flate2::read::GzEncoder` that compresses in `Read` #17

Closed emk closed 4 years ago

emk commented 6 years ago

Thank you for a writing a pure Rust port of snappy! We use it heavily at Faraday for fast compression and decompression of huge files in many places.

We recently discovered a need for a "snappy framed" equivalent of the flate2::read::GzEncoder type. That is, a snappy encoder which implements Read, consuming uncompressed data and outputting compressed data.

I'm volunteering to write this and send you a PR with tests and docs, assuming you're interested. :-) Thank you as always for all your great Rust tools.

BurntSushi commented 6 years ago

@emk I think that sounds reasonable! One concern I have here is figuring out which types do what. I had to carefully read the GzEncoder and GzDecoder docs a few times very carefully, and I'm still having trouble thinking about when I would use them. Perhaps you could say some words about your use case that might help me understand? We might also consider a naming overhaul (and releasing a 0.3 version) if the existing Reader/Writer names become confusing, although I do like their simplicity.

emk commented 6 years ago

OK, here's a sample use case:

  1. I have an input stream of uncompressed data that implements Read. This might be a File, or it might be some multi-gigabyte PostgreSQL query generating CSV data with COPY. I don't want to write this to disk, I just want to pass it through.
  2. I have a streaming S3 uploader that takes a Read instance, breaks the data into 5MB chunks as it reads, and then uploads the chunks to S3 in parallel using upload_object_part and a worker pool. All queues are bounded and blocking, thanks to your chan. (We're going to open source the worker pools and S3 client soon.)
  3. In between (1) and (2), I want to stick something like snap::read::SzipEncoder, so that I compress the data as I read it before handing it off to the "chunker" for uploading to S3. This would be very easy and transparent.

(In theory, we could do all this using Tokio, but company policy is that we're not using Tokio before async and await land.)

Basically, "compressing Read" and "decompressing Write" are rarer use cases. But when you have one API that gives you a Read object, and another one that consumes a Read object, and you want to compress in between, then it's really useful to have.

Most of the "big" Rust compression libraries offer all 4 versions of (compressing, decompressing) x (read, write). I have an OK to implement at least "compressing Read" on work time. And I've implemented snappy framing in Rust before by wrapper the C version, but I've deprecated my crate in favor of yours. :-)

I'm agnostic about names. We could imitate flate2 for ecosystem-wide consistency, or we could just add the two missing versions using longer names, because they're used less often.

emk commented 6 years ago

Good morning!

I'm starting work on a ReadCompressing implementation this morning, which I hope to submit as a PR shortly. Obviously, you should feel free to pick a better name.

BurntSushi commented 6 years ago

@emk Thanks so much for the great write up! Definitely fully on board with adding this stuff. I basically have two concerns:

And I've implemented snappy framing in Rust before by wrapper the C version, but I've deprecated my crate in favor of yours. :-)

:-) Are you perhaps hinting here that the Snappy frame format has to be implemented twice to support this use case? If so, I begrudgingly accept, but man, that would be a bummer.

I'm agnostic about names. We could imitate flate2 for ecosystem-wide consistency, or we could just add the two missing versions using longer names, because they're used less often.

This is really bothering me. I don't know what to do. I value consistency with other crates in the ecosystem very highly. On that alone I'm inclined to do what flate2 does, including emulate the module hierarchy. On the other hand, I also wonder whether we can avoid a module hierarchy at all and perhaps do a better job with naming. But I honestly don't have any good ideas.

@alexcrichton Do you have any thoughts here? I'll summarize briefly: @emk wants to add the equivalent of flate2::read::GzEncoder to the snappy crate (that is, a Read impl for uncompressed->compressed), and we're trying to figure out naming. Are you happy with the naming and module structure that flate2 uses? If not, what changes would you make?

alexcrichton commented 6 years ago

I've personally been pretty happy with how flate2 turned out, although I wouldn't call it a silver bullet or without room for improvement by any means :). In any case though I don't have any better names off the top of my head I'd rather have named the types at this moment

BurntSushi commented 6 years ago

@emk All right, so how do you feel about re-creating flate2's module structure? I think that means we leave Decoder/Encoder as-is, then re-create read/write modules and have read::FrameEncoder, read::FrameDecoder and write::FrameEncoder (and we could add write::FrameDecoder later if a use case comes up).

emk commented 6 years ago

@BurntSushi I don't think I'm hinting at anything; hinting would require more brain cells than I have right now. :-)

I once had a snappy crate with a framing implementation ("snappy framed", IIRC, on crates.io) that I deprecated a while back because it used the C FFI, and you had a pure Rust version. I don't remember whether mine implemented all four modes or not. It's a moot point; I'm not going to use mine, and I would rather contribute to your crate. Your crate is simply better, so mine is gone. :-)

(I generally find it annoying whenever there are 5 subtly different but incomplete versions of every library, something which is extremely common with some langauges, but rarer with Rust. And so I regularly deprecate my stuff and contribute to other people's crates when other people's crates have a fundamentally better design. Your crate is written in pure, performant Rust, something I never managed to accomplish with mine. Plus, it reduces the number of crates I need to actively maintain!)

I'm happy to go with the flate2 structure and the names you propose. Thank you for the spec!

BurntSushi commented 6 years ago

@emk No no, you misunderstand. :) I wasn't referring to your other work, but whether adding this new implementation would require two distinct internal implementations of the frame format. I haven't touched this code in a long time, so it is perhaps a naive question!

emk commented 6 years ago

OK, I've opened a working branch at https://github.com/BurntSushi/rust-snappy/pull/18, with the initial renaming of the existing Reader and Writer (including in the docs and the szip) source code.

I've also set up a cargo workspace that includes both snap and szip to simplify running tests, and attempted to make sure that CI also includes szip.

BurntSushi commented 6 years ago

@emk Excellent, thank you! Please ping me when you'd like me to review.