chmduquesne / rollinghash

Implementation of some rolling hashes in go
MIT License
64 stars 11 forks source link

Allow access to Adler-32's internal window #10

Closed greatroar closed 4 years ago

greatroar commented 4 years ago

This PR adds a WriteTo method to Adler32 that allows clients to obtain a copy of its rolling window. This is useful for Syncthing and other applications that want to inspect/use the parts of a file that match a given hash.

This could probably be done for more hashes, but Adler-32 is the one used by Syncthing.

chmduquesne commented 4 years ago

Hi!

Why do you want to copy the content of the rolling window? Its only purpose is strictly to know what is the oldest byte.

I mean, if it is absolutely necessary, it could be done, but currently I am not sure whether you are using this lib in the right way then. I will first try to have a look at your client code and find out what you are doing...

Cheers!

chmduquesne commented 4 years ago

Hi again,

Let's assume that opening access to the internal rolling window is a good idea a second.

I find it extremely confusing to do this via the WriterTo interface. When you use h.Write(), you are writing something into the hash, and you are altering its state. If you want to take a peak at the rolling window, it would make much more sense to me to implement this via the Reader interface.

Cheers, Christophe-Marie

chmduquesne commented 4 years ago

Hi there,

I implemented a way to extract the rolling window, but from a io.Reader interface since I believe it is much cleaner. It is added for every hash, not just rollinghash.adler32. I added a code snippet in README.md to show how to use it.

This is not an official version of the library yet, it just sits on master for now, but I plan to make a tagged release after further testing (it only has one test for now).

For the folks in syncthing/syncthing#6845, hopefully this can be leveraged in order to solve your problem.

Cheers, Christophe-Marie

greatroar commented 4 years ago

Your new Read method doesn't adhere to Reader conventions. Normally, if you Read m bytes, then Read n bytes, you get the first m+n bytes of a stream. Here, you get the first min(m,n) bytes twice. That's why I went for WriteTo, which is for objects that can write themselves to a external Writer in however many Write calls are necessary. It's closer to Read than to Write in terms of semantics.

chmduquesne commented 4 years ago

You are absolutely right, thank you for pointing that out.

chmduquesne commented 4 years ago

Then indeed writeTo must be the interface needed. I see that io.Copy is even using it when Src implements it.

Sorry for rewriting your PR, I thought you had made a poor choice but after giving it a closer look, I am the one who was mistaken.

chmduquesne commented 4 years ago

Your implementation does not account for when the writer cannot accommodate for the full slice. I thought of using retry loops:

// WriteTo writes the content of the rolling window into w, until it is
// entirely written or an error is met.
func (d *Adler32) WriteTo(w io.Writer) (n int64, err error) {
    var nw int
    newer, older := d.window[d.oldest:], d.window[:d.oldest]
    for n < int64(len(newer)) && err != nil {
        nw, err = w.Write(newer[n:])
        n += int64(nw)
    }
    for n < int64(len(d.window)) && err != nil {
        nw, err = w.Write(older[n-int64(len(newer)):])
        n += int64(nw)
    }
    return n, err
}

However, after re-reading the documentation of io.WriterTo, I am again unsure whether this is using the right semantics.

WriteTo writes data to w until there's no more data to write or when an error occurs. The return value n is the number of bytes written. Any error encountered during the write is also returned.

I am putting emphasis on until there's no more data to write because it is clear that after a call to WriteTo, the rolling window is untouched, so there is still data to write.

I am starting to think that it would make more sense to call this Window without argument, make it return a byte slice and an error, and have it rearrange the window before returning it directly.

greatroar commented 4 years ago

I interpreted "until there's no more data" as "the entire window has been written out" and a next call without doing a Roll would restart at the beginning. If you look in the stdlib, most implementations of WriterTo consume some underlying Reader. The exception is go/types.Scope.WriteTo, where it is used for serialization and calling it twice delivers the same result.

Of course, not using a standard interface in the first place makes the whole discussion moot.

The main reason I wanted to use something that takes a Writer is that it works both when writing to memory (var w bytes.Buffer; h.WriteTo(&w)) or to a file directly, since Syncthing ultimately writes what it finds to a file (and maybe Kopia too, which I believe also uses this lib). I didn't use it in the end, but considered that it might make for an optimization later. Syncthing uses multi-MB windows and runs in the background on desktop and mobile.

greatroar commented 4 years ago

Regarding retry loops, I left those out because the io.Writer doc says

Write must return a non-nil error if it returns n < len(p).

So a short write is really an error condition as far as client code is concerned. In fact, I think retrying could mask some bug in the underlying Writer, making it harder to find.

chmduquesne commented 4 years ago

The exception is go/types.Scope.WriteTo, where it is used for serialization and calling it twice delivers the same result.

Ok, close enough. If the standard lib does this, I don't see any problem following the same pattern with the rolling window.

Write must return a non-nil error if it returns n < len(p).

Ok, your implementation works, then. I wrote mine in a way that if there is an error, the loop stops and it returns immediately, just in case Write writes less bytes than required but it is not an error. However what you found explicitly rules this situation as impossible, so I am introducing complexity for no good reason. I will use your implementation.

greatroar commented 4 years ago

Hold on. I only now notice that that WriteTo doesn't conform to the interface. They reuse only the name.

What if we rename the method WriteWindow? Then we can also change the return value to int instead of int64. It will never write more than the window size, which cannot be larger than maxint because it has to fit in memory.

greatroar commented 4 years ago

I've force-pushed a version that uses WriteWindow and implements it for all hashes.

chmduquesne commented 4 years ago

If we do not use a standard interface, I would prefer to have the method to have the following signature

Window() (w byte[], err error)

I would have it rearrange the rolling window (like I do in Write()) and return the window Slice directly. This can be used to create a buffer via to bytes.NewBuffer(window) which satisfies the WriterTo interface and can do whatever you need after that.

Sorry, there is a fair amount of duplicated code between the rolling hashes. This is not because I can't be bothered with refactoring, this is because a refactoring has a performance costs that I do not want the users of this lib to absorb. The point of this library is speed after all, so the compiler must be able to inline.

chmduquesne commented 4 years ago

Elaborating on what I wrote above: the reason why I prefer the aforementioned method signature is because it avoids allocating memory to receive the data, and it mimics bytes.Buffer.Bytes(), except that I want it to also return an error in case something goes wrong during the re-arrangement of the window. I also think it is better than going through an io.Writer because you get directly a slice that you can feed at will to an io.Writer if your desire is to abstract, but that you can also directly interact with if you want to do something more low level.

I am gladly taking your feedback on this suggestion and if you want to write this code, I would rather have you do it and claim the contribution than do it myself, since I believe that an increased amount of external contributions is creating confidence in the library and alleviates the bus factor. I am also open to move this library out of my personal github account to a dedicated organization, if you plan to keep being a maintainer.

greatroar commented 4 years ago

Sorry, I missed that you want to provide a view onto an internal data structure. I didn't consider that option as it's kind of dangerous if the caller fails to properly copy the result. E.g.,

if matches(h.Sum32()) {
    my.result, _ = h.Window()
}
// on the next h.Roll call, my.result will be invalidated.

Regarding memory allocation, that's not even necessary for the Syncthing use case: it first wants to check the window against a SHA-256, viz.

var sha hash.Hash
h.WriteWindow(sha)
// check sha.Sum(nil) against a list of blocks

then write the result out to a file. This can all be done in constant memory. Of course, not everything you want to do with the window is implemented as a Writer, but many things can be. You do get the overhead of two function calls through the Writer interface, of course.

As a side remark concerning performance: I'm personally not terribly concerned with it for this specific operation. For big enough blocks (say, the 128kiB-8MB blocks that Syncthing uses), it's memory-bound in any case. I noticed when reimplementing Syncthing stuff that the bottleneck is the consumer (hash table lookups), followed by the actual computations for Adler-32 (three divisions per Roll). I didn't check Buzhash, though, and I'm not sure what other consumers do.