Clarification on sync behaviour

uazu commented 3 years ago

I'm using miniz_oxide for streaming compression/decompression. I need reliable sync points, which mark the ends of records. I have it working now (and I've fuzzed my wrapper too, to be sure), but it took a few attempts. My observations are:

If I use TDEFLFlush::Sync then it does a flush and a sync, but only if it will all fit in the output buffer, otherwise the sync sequence will be missing (as expected)
If on the following call, I don't use TDEFLFlush::Sync, and there is still data to write, then the flush will continue, but the sync sequence won't be written, so the other end will not have complete data for the end of the record. So it appears that the flush is sticky, but the sync isn't.
However if on the following call, I do use TDEFLFlush::Sync, then if I've already written a sync sequence on the first call, it will write another one on this call. (But I don't want two sync sequences.)

So my solution is to give it a big buffer to write to so that both the flush and the sync can be written in one go. I hope this takes care of things (it did in testing), but in the unlikely case that it didn't, if it filled that buffer then I write another one with TDEFLFlush::Sync. I'd get two syncs in that case, but it's better than my protocol locking up.

Since this appears to work reliably in my fuzz-testing, I'm going with this. However it would be good to document this behaviour just in case someone else needs to do the same.

oyvindln commented 3 years ago

Hm, that doesn't seem entirely optimal, there ought to be a cleaner way to ensure this. I'm not entirely sure whether that specific behavior was intentional and the idea was that one would re-call with the same parameter if everything was not written, or if that's an oversight and it should store that it has been called with flush.

Ideally we would want a cleaner/better api in general though.

uazu commented 3 years ago

I guessed it might have been inherited behaviour from the original implementation, so in that case best thing to do is just to document it. But certainly if the API (or a new API) can be made cleaner that would be ideal. Also, if considering a new API, it needs to stay low-level, i.e. to have the ability to just move data between two buffers as data comes in. Other crates only provide Write as an interface, which is really inconvenient.

oyvindln commented 2 years ago

Totally forgot to reply again.. Yeah it's supposed to stay low level, leaving writer/reader and stuff to flate2. I'm not sure what the best way to design a proper API is though, personally I've been busy with other projects so haven't looked into it much.

oyvindln commented 2 years ago

Ok, after some digging it looks the current behaviour may be what zlib does for the corresponding deflate call (I presume C miniz would behave the same as it's supposed to be compatible with the zlib API, though it doesn't have much documentation and the code is not particularly easy to read.) That is, the caller needs to check that the output buffer is large enough to contain the flush. If deflate returns with avail_out == 0, this function must be called again with the same value of the flush parameter and more output space (updated avail_out), until the flush is complete (deflate returns with non-zero avail_out). In the case of a Z_FULL_FLUSH or Z_SYNC_FLUSH, make sure that avail_out is greater than six to avoid repeated flush markers due to avail_out == 0 on return.

The flush modes are not very widely documented in general though so it's all a bit confusing.

I do think we could add some functionality that stores whether a flush was output or not though as an addition though, maybe using a separate flush parameter or something, ideally we would want the API designed to prevent any erroneous ways of calling the functions that don't produce the expected output.

uazu commented 2 years ago

I think the easiest solution is to work around it, i.e. always have a big-enough buffer. If the docs said "always use a large buffer when flushing sync points", then I would have avoided the trouble. (Also document how big exactly -- I'm guessing slightly larger than 32K, for the case of incompressible data.) Since I was streaming over a low-bandwidth medium, I was using smaller buffers than that. So easiest solution would be to document it.

But if you want to improve the API, then yes I could have worked around the problem if there was some way to detect whether the sync sequence had been output or not, i.e. whether the flush+sync had completed. Then I could keep calling it with TDEFLFlush::Sync until I got to that "fully flushed" state. That would resolve things when using small output buffers, as far as I can see.

That does however push the responsibility onto the coder to remember to keep calling with TDEFLFlush::Sync until the sync is complete. However if that was clearly documented, then that will probably be fine. The alternative might be to make the sync sticky, i.e. keep an internal flag that remembers that a sync was requested, and keep attempting a sync until the sync sequence is finally output. I don't know whether that will break anyone else's assumptions though.

Frommi / miniz_oxide

Clarification on sync behaviour #105