chmduquesne / rollinghash

Implementation of some rolling hashes in go
MIT License
64 stars 11 forks source link

buzhash32 hashes changed #7

Closed jkowalski closed 7 years ago

jkowalski commented 7 years ago

HI,

I'm a developer on https://github.com/kopia/kopia. I've been using your buzhash32 for dynamic file splitting based on properties of a rolling hash. To ensure stability of results I've previously submitted a baseline test that ensures hashes never change and are only based on content. Unfortunately it started failing yesterday on buzhash32

From my initial investigation it appears that buzhash32 values are different than they were before and my object splitter started producing completely different results.

I confirmed that rolling your repo back to 043b8fdecc9816f0011a056f6d92f9a091ab63dd (with no other changes) fixed the issue, so it must have been caused by f15bd480cb55422598202b78affc10ee7c5a8191.

The test in question is TestSplitterStability in:

https://github.com/kopia/kopia/blob/master/object/object_splitter_test.go

chmduquesne commented 7 years ago

Hey Jarek,

Thanks for your bug report! I will investigate this ASAP.

Sorry for breaking your test suite :\

Cheers Christophe-Marie

On Mon, Nov 13, 2017 at 3:48 PM, Jarek Kowalski notifications@github.com wrote:

HI,

I'm a developer on https://github.com/kopia/kopia. I've been using your buzhash32 for dynamic file splitting based on properties of a rolling hash. To ensure stability of results I've previously submitted a baseline test that ensures hashes never change and are only based on content. Unfortunately it started failing yesterday on buzhash32

From my initial investigation it appears that buzhash32 values are different than they were before and my object splitter started producing completely different results.

I confirmed that rolling your repo back to 043b8fd https://github.com/chmduquesne/rollinghash/commit/043b8fdecc9816f0011a056f6d92f9a091ab63dd (with no other changes) fixed the issue, so it must have been caused by f15bd48 https://github.com/chmduquesne/rollinghash/commit/f15bd480cb55422598202b78affc10ee7c5a8191 .

The test in question is TestSplitterStability in:

https://github.com/kopia/kopia/blob/master/object/object_splitter_test.go

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chmduquesne/rollinghash/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOUMOJQ7-i6SU6WGMw5N-1wu76CpYW_ks5s2FbNgaJpZM4Qb4I3 .

chmduquesne commented 7 years ago

Weird.

Unless you are rolling a window with size zero, the change I made should have zero impact.

I do not quite understand your test suite and I am unable to identify where the bug comes from so far. Could you please provide a window where rolling does not produce the intended result?

Cheers, Christophe-Marie

On Mon, Nov 13, 2017 at 3:52 PM, Christophe-Marie Duquesne chmd@chmd.fr wrote:

Hey Jarek,

Thanks for your bug report! I will investigate this ASAP.

Sorry for breaking your test suite :\

Cheers Christophe-Marie

On Mon, Nov 13, 2017 at 3:48 PM, Jarek Kowalski notifications@github.com wrote:

HI,

I'm a developer on https://github.com/kopia/kopia. I've been using your buzhash32 for dynamic file splitting based on properties of a rolling hash. To ensure stability of results I've previously submitted a baseline test that ensures hashes never change and are only based on content. Unfortunately it started failing yesterday on buzhash32

From my initial investigation it appears that buzhash32 values are different than they were before and my object splitter started producing completely different results.

I confirmed that rolling your repo back to 043b8fd https://github.com/chmduquesne/rollinghash/commit/043b8fdecc9816f0011a056f6d92f9a091ab63dd (with no other changes) fixed the issue, so it must have been caused by f15bd48 https://github.com/chmduquesne/rollinghash/commit/f15bd480cb55422598202b78affc10ee7c5a8191 .

The test in question is TestSplitterStability in:

https://github.com/kopia/kopia/blob/master/object/object_splitter_test.go

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chmduquesne/rollinghash/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOUMOJQ7-i6SU6WGMw5N-1wu76CpYW_ks5s2FbNgaJpZM4Qb4I3 .

jkowalski commented 7 years ago

I’ll take a look tonight.

Sent from my iPhone

On Nov 13, 2017, at 07:29, Christophe-Marie Duquesne notifications@github.com wrote:

Weird.

Unless you are rolling a window with size zero, the change I made should have zero impact.

I do not quite understand your test suite and I am unable to identify where the bug comes from so far. Could you please provide a window where rolling does not produce the intended result?

Cheers, Christophe-Marie

On Mon, Nov 13, 2017 at 3:52 PM, Christophe-Marie Duquesne chmd@chmd.fr wrote:

Hey Jarek,

Thanks for your bug report! I will investigate this ASAP.

Sorry for breaking your test suite :\

Cheers Christophe-Marie

On Mon, Nov 13, 2017 at 3:48 PM, Jarek Kowalski notifications@github.com wrote:

HI,

I'm a developer on https://github.com/kopia/kopia. I've been using your buzhash32 for dynamic file splitting based on properties of a rolling hash. To ensure stability of results I've previously submitted a baseline test that ensures hashes never change and are only based on content. Unfortunately it started failing yesterday on buzhash32

From my initial investigation it appears that buzhash32 values are different than they were before and my object splitter started producing completely different results.

I confirmed that rolling your repo back to 043b8fd https://github.com/chmduquesne/rollinghash/commit/043b8fdecc9816f0011a056f6d92f9a091ab63dd (with no other changes) fixed the issue, so it must have been caused by f15bd48 https://github.com/chmduquesne/rollinghash/commit/f15bd480cb55422598202b78affc10ee7c5a8191 .

The test in question is TestSplitterStability in:

https://github.com/kopia/kopia/blob/master/object/object_splitter_test.go

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/chmduquesne/rollinghash/issues/7, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOUMOJQ7-i6SU6WGMw5N-1wu76CpYW_ks5s2FbNgaJpZM4Qb4I3 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jkowalski commented 7 years ago

Very short repro:

package main

import (
    "log"
    "math/rand"

    "github.com/chmduquesne/rollinghash/buzhash32"
)

func main() {
    r := rand.New(rand.NewSource(5))
    rnd := make([]byte, 30)
    if n, err := r.Read(rnd); n != len(rnd) || err != nil {
        log.Fatalf("can't initialize random data: %v", err)
    }

    h := buzhash32.New()
    for i, b := range rnd {
        h.Roll(b)
        log.Printf("index: %v s: %v", i, h.Sum32())
    }
}

Produces at HEAD:

2017/11/13 18:54:49 index: 0 s: 2932379532
2017/11/13 18:54:49 index: 1 s: 58131087
2017/11/13 18:54:49 index: 2 s: 656199471
2017/11/13 18:54:49 index: 3 s: 393438141
2017/11/13 18:54:49 index: 4 s: 2408814438
2017/11/13 18:54:49 index: 5 s: 1371362761
2017/11/13 18:54:49 index: 6 s: 2696985740
2017/11/13 18:54:49 index: 7 s: 1291251581
2017/11/13 18:54:49 index: 8 s: 3957633017
2017/11/13 18:54:49 index: 9 s: 2407370554
2017/11/13 18:54:49 index: 10 s: 866135285
2017/11/13 18:54:49 index: 11 s: 592537940
2017/11/13 18:54:49 index: 12 s: 2549851453
2017/11/13 18:54:49 index: 13 s: 2185312953
2017/11/13 18:54:49 index: 14 s: 3402646417
2017/11/13 18:54:49 index: 15 s: 1289975679
2017/11/13 18:54:49 index: 16 s: 470101980
2017/11/13 18:54:49 index: 17 s: 3115988171
2017/11/13 18:54:49 index: 18 s: 1124179524
2017/11/13 18:54:49 index: 19 s: 2986207157
2017/11/13 18:54:49 index: 20 s: 98298901
2017/11/13 18:54:49 index: 21 s: 408173852
2017/11/13 18:54:49 index: 22 s: 58650337
2017/11/13 18:54:49 index: 23 s: 1590186338
2017/11/13 18:54:49 index: 24 s: 1469355562
2017/11/13 18:54:49 index: 25 s: 3890660866
2017/11/13 18:54:49 index: 26 s: 4257260820
2017/11/13 18:54:49 index: 27 s: 1453969012
2017/11/13 18:54:49 index: 28 s: 340876147
2017/11/13 18:54:49 index: 29 s: 3735913607

at 043b8fdecc9816f0011a056f6d92f9a091ab63dd

2017/11/13 18:56:27 index: 0 s: 0
2017/11/13 18:56:27 index: 1 s: 1592139158
2017/11/13 18:56:27 index: 2 s: 2621370653
2017/11/13 18:56:27 index: 3 s: 1631002584
2017/11/13 18:56:27 index: 4 s: 1662742444
2017/11/13 18:56:27 index: 5 s: 2293015644
2017/11/13 18:56:27 index: 6 s: 316891047
2017/11/13 18:56:27 index: 7 s: 682699050
2017/11/13 18:56:27 index: 8 s: 594223959
2017/11/13 18:56:27 index: 9 s: 509777511
2017/11/13 18:56:27 index: 10 s: 295575118
2017/11/13 18:56:27 index: 11 s: 1731005474
2017/11/13 18:56:27 index: 12 s: 520318929
2017/11/13 18:56:27 index: 13 s: 2477829984
2017/11/13 18:56:27 index: 14 s: 3912460323
2017/11/13 18:56:27 index: 15 s: 186982427
2017/11/13 18:56:27 index: 16 s: 2475262228
2017/11/13 18:56:27 index: 17 s: 2795727194
2017/11/13 18:56:27 index: 18 s: 2100503910
2017/11/13 18:56:27 index: 19 s: 3449328113
2017/11/13 18:56:27 index: 20 s: 4245749917
2017/11/13 18:56:27 index: 21 s: 3921798157
2017/11/13 18:56:27 index: 22 s: 3763690690
2017/11/13 18:56:27 index: 23 s: 2560570661
2017/11/13 18:56:27 index: 24 s: 3678035621
2017/11/13 18:56:27 index: 25 s: 4273690397
2017/11/13 18:56:27 index: 26 s: 3480979242
2017/11/13 18:56:27 index: 27 s: 870287880
2017/11/13 18:56:27 index: 28 s: 3736982411
2017/11/13 18:56:27 index: 29 s: 1265931638
jkowalski commented 7 years ago

(note that those sequences are stable since I'm initializing random with fixed seed)

chmduquesne commented 7 years ago

Thank you for that. I will try to fix this and then incorporate this kind of tests in the repo.

On Nov 14, 2017 03:57, "Jarek Kowalski" notifications@github.com wrote:

(note that those sequences are stable since I'm initializing random with fixed seed)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chmduquesne/rollinghash/issues/7#issuecomment-344130811, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOUMAUAiTWnPnjFRO698nbJ3O_kle2lks5s2QGNgaJpZM4Qb4I3 .

chmduquesne commented 7 years ago

From what I see, your test is misusing this library: you need to initialize a window before starting to roll it, otherwise the library has no idea what window size you want, and what this initial window looks like. So let's assume a window of 64 bytes, here is what correct code would look like:

package main

import (
    "log"
    "math/rand"

    "github.com/chmduquesne/rollinghash/buzhash32"
)

func main() {
    r := rand.New(rand.NewSource(5))
    rnd := make([]byte, 94) // 30 + 64
    if n, err := r.Read(rnd); n != len(rnd) || err != nil {
        log.Fatalf("can't initialize random data: %v", err)
    }

    h := buzhash32.New()

    // Initialize the rolling window
    // - Any window size except 0 rolls correctly.
    // - If one uses Write() on a window of size 0 anyway, the correct
    //   hash is returned when Sum32() is called, but will return
    //   incorrect results after Roll() is called.
    h.Write(rnd[:64])

    for i, b := range rnd[64:] {
        h.Roll(b)
        log.Printf("index: %v s: %v", i, h.Sum32())
    }
}

It returns the same results on HEAD and 043b8fd.

[Edit: I made a mistake pasting the corrected code twice]

chmduquesne commented 7 years ago

I am closing the issue, because as far as I can see, this behavior is due to incorrect use of this library (rolling without prior initialization using Write()). I will however write tests based on what you submitted, in order to detect, for given inputs, that the code yields stable results.

The code to check whether a window was initialized prior to rolling was removed for performance reasons (every operation matters when you repeat it on multi terabyte files), so I don't intend to write any code to prevent this kind of mistakes. I will however try to emphasize the importance of initializing in the Roll() documentation.

jkowalski commented 7 years ago

That's fair. Couldn't you make window initialization explicit (part of construction call)? I'm sure this would avoid surprises like mine.

chmduquesne commented 7 years ago

I am reluctant to do that, because Write() is exactly supposed to do that job: Initializing/Updating the digest. There is no good reason to not expect it to create the rolling window. What I can (and will) do is to issue a big warning in the documentation of Roll().