bitemyapp / brotli2-rs

Brotli encoders/decoers for Rust
Apache License 2.0
28 stars 13 forks source link

Use new compressor API #11

Closed aidanhs closed 7 years ago

aidanhs commented 7 years ago

Followup to #10 once it gets merged. This is a WIP.

The new design of brotli seems a little odd, but I've tried to represent it as reasonably as possible. Some notes/thoughts so far:


After struggling for some time with it, I'd like to talk about the Write trait. As it stands, it's seems extremely difficult to implement it as described in the documentation. There are two quotes of interest for the write method:

  1. "A call to write represents at most one attempt to write to any wrapped object."
  2. "If an error is returned then no bytes in the buffer were written to this writer."

There are two possible interpretations of 1:

a. 'You can only call write once on each wrapped object' b. 'If a call to write fails, you're not allowed to retry'

The implications of these are as follows

1a. Not allowed call write multiple times, so write_all is clearly forbidden. As we need to ship data to a writer which may be slower than data is coming in, you either need an unbounded buffer or the ability to use Ok(0) to indicate you're flushing some data. 1b. Not allowed to use write_all as it masks Interrupted. But you can replace it with your own write_all that bubbles up any error, so this isn't too problematic.

Now let's completely ignore 1, and just consider 2. The implication here is that writes are 'retry-safe', i.e. if you get an error you can try again with the buffer you just tried with and if it succeeds then all is well. This forbids write_all as you can't keep track of where you were up to if an error occurs and you may have already written some of the buffer, making your code retry-unsafe.

So, write_all is basically forbidden by anyone implementing Write who wants to follow these rules. It turns out that the current implementation in this library uses write_all, but tries to be retry-safe - I am reasonably confident I can come up with a test that demonstrates this.

Looking up the original IO RFC, the intention seems to have been that both 1a and 1b are the desired interpretations. I contend that adhering to these three rules is not possible in a moderately complex writer. Let's take a look a zlib. It mentions the issue I raise about Ok(0), and of the three options for dealing with 1a (returning Ok(0), using an unbounded buffer, ignoring the requirement) has opted for ignoring the requirement, happily attempting to write to a contained object more than once.

It's also worth talking about the interaction between 1b and 2 here. A naive implementer following these two rules might take input, process it, and attempt to write all the bytes out in a loop, keeping track of progress. If one of the writes fails, the implementer must return the number of bytes written and so the error needs to be stashed away until the next attempted operation when it can be bubbled up. This is horrible, but can be alleviated by the One True Way displayed in zlib - an inversion of the naive logic I describe above, with fallible IO coming first, followed by consumption of the input buffer that will not fail on IO, effectively queuing the IO for the next call.

I'm just noting down my experiences (this is a terser version of a comment I lost) as I plan to:

alexcrichton commented 7 years ago

I feel like we should be pushing people towards the methods that correspond to the C API

Definitely agreed! Looks like you're already taking care of all that though :)

There are two quotes of interest for the write method:

Heh quite a lot of interesting discussion here! It's true yeah that writing a 100% correct Write implementaiton is actually pretty tricky. I tink you've hit most of the high points, though, and PRs to shore up the docs here in libstd would be most welcome!

alexcrichton commented 7 years ago

Ok with #10 merged want to rebase this and I'll help take a look at the apis?

aidanhs commented 7 years ago

Rebased and pushed. More thoughts about the current API (in addition to the original comment):

  1. I like that compress/decompress expose an non-blocking interface you can use if you need.
  2. I'm not such a fan of compress/decompress being so hard to use correctly, the whole "ref to a ref" thing feels quite unnatural in Rust.
  3. I could imagine implementing process, flush and finish as methos on Compress if there seems to be a need, but the Read/Write traits feel good enough for now.
  4. Perhaps it's worth moving compress/decompress into a lowlevel module and CompressParams/CompressMode up to the top level of the library (so stream vanishes entirely). I know that when I found this crate, I visited the stream module first and got confused at how manual it all seemed to be to use.
aidanhs commented 7 years ago

Ok, addressed review comments, added a commit to run the example in CI and I've also moved the 'low level' stuff into a raw module since I reckon the word stream is too tantalising for people looking at the library.

alexcrichton commented 7 years ago

Looks great to me, thanks so much again @aidanhs!

Out of curiosity, would you like to be made a collaborator for this repo?

aidanhs commented 7 years ago

No worries, I initially started on this yak shave because I was getting different compressed outputs via different methods of compressing and wondered if upgrading would help. Turns out that brotli handles an empty input specially when compressing a buffer rather than stream (https://github.com/google/brotli/blob/2c001010aa49b06de83bf28ec43be4fed8be3fbd/c/enc/encode.c#L1384-L1389) -_-

Anyway, sure! Seems pretty low activity though so it may well just be me making PRs that you review still.

alexcrichton commented 7 years ago

Heh ok, no worries :)