dgryski / go-farm

go-farm: a pure-Go farmhash implementation
MIT License
238 stars 22 forks source link

Undeterministic fingerprint on the same string #12

Closed manishrjain closed 6 years ago

manishrjain commented 6 years ago

While running a bank test on Badger, I narrowed down a failure on farm.Fingerprint method being used for conflict detection among keys. Produced this PR: https://github.com/dgraph-io/badger/pull/593

A long-running test (over 4h at this point) is starting to look like this is indeed the fix.

Way to test: badger bank test --dir=<some-dir> --dur=4h (with and without this PR)

The txn conflict detection code is here: https://github.com/dgraph-io/badger/blob/master/transaction.go#L147-L171

Notes from observation:

Comparing @ts=437427 with @ts=437428
Index: 92881. Account [{Id:92881 Bal:120}] -> [{Id:92881 Bal:115}]
Index: 94432. Account [{Id:94432 Bal:115}] -> [{Id:94432 Bal:120}]

2018/09/26 09:49:12 Balance did NOT match up. Expected: 10000000. Received: 9999995
Index: 57525. Account [{Id:57525 Bal:115}] -> [{Id:57525 Bal:120}]
Index: 94432. Account [{Id:94432 Bal:120}] -> [{Id:94432 Bal:110}]

So, both read $115. @ts=437428, 94332 acct went from 115 -> 120.
@ts=437443, 94332 acct went from the read 115 -> 110 (but the actual was 120).
This should have been a txn conflict. It seems like farm.Fingerprint might have
generated a non-deterministic hash.

If helpful, I could produce a main program which directly load tests farm.Fingerprint method over a million keys (similar to what Badger bank test is doing).

dgryski commented 6 years ago

Having a smaller main would be great. Also good to know if the cgo farmhash bindings have the same issue. You could link to those (they should have the same API) and complain if there's a mismatch.

manishrjain commented 6 years ago

I have a main program, but I'm not able to reproduce the indeterminism with it so far; so this could be a false alarm.

package main

import (
    "encoding/binary"
    "log"
    "math/rand"
    "sync"
    "sync/atomic"
    "time"

    farm "github.com/dgryski/go-farm"
)

var keyPrefix = "account:"

func key(account int) []byte {
    var b [4]byte
    binary.BigEndian.PutUint32(b[:], uint32(account))
    return append([]byte(keyPrefix), b[:]...)
}

func main() {
    m := make(map[string]uint64)
    N := 1000000
    for i := 0; i < N; i++ {
        k := key(i)
        m[string(k)] = farm.Fingerprint64(k)
    }

    var wg sync.WaitGroup
    var total uint64
    start := time.Now()
    for i := 0; i < 16; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            for {
                a := rand.Intn(N)
                k := key(a)
                fp := farm.Fingerprint64(k)
                expected, has := m[string(k)]
                if !has {
                    log.Fatalf("Unable to find account: %d", a)
                }
                if fp != expected {
                    log.Fatalf("key=%q. expected=%d. found=%d\n", k, expected, fp)
                }
                if runs := atomic.AddUint64(&total, 1); runs%100000 == 0 {
                    log.Printf("[%6s] Runs: %d\n",
                        time.Since(start).Round(time.Second).String(), runs)
                }
            }
        }()
    }
    wg.Wait()
}
dgryski commented 6 years ago

I wonder if the full badger code there's a data race or something else where the input is getting accidentally altered before being passed to Fingerprint? Do these tests run with -race ?

manishrjain commented 6 years ago

Yeah, our CI and test.sh both run with the race flag. Also, the change that I did (linked above) "seems" to have fixed it (running for over 5 hrs now), and it gets the same input as what we pass to fingerprint.

dgryski commented 6 years ago

Might be worth updating your "use strings" PR to also log the actual inputs that are generated and see if we can get a test case for Fingerprint out of it. Note also that using strings will end up copying the data whereas a []byte is still mutable and might be altered by action-at-a-distance. Maybe try copying the byte slice before passing it to Fingerprint? If it still fails we can know it's Fingerprint, otherwise if it passes we know Fingerprint is deterministic and it's the badger code that's altering the slice somehow.

manishrjain commented 6 years ago

Hmm... Ok. I'll try that. A teammate and I are running the above main and the badger bank test on multiple machines. Once they run for 10+ hrs and don't fail, I'll try the strategy you suggested.

manishrjain commented 6 years ago

The main program I wrote above has been running OK for 9.5 hrs. I have no reason to believe that the fingerprint method is indeterministic. Closing the issue.