lmdb: strange numbers of allocations reported by benchmarks

bmatsuo commented 8 years ago

@lmb pointed out some anomalous looking numbers reported by go test -bench=. -benchmem in #61, Specifically regarding the BenchmarkTxn_Put function.

$ go test -test.run=None -test.bench='BenchmarkTxn_Put$' -benchmem
PASS
BenchmarkTxn_Put-4       1000000              1759 ns/op              96 B/op          4 allocs/op
--- BENCH: BenchmarkTxn_Put-4
        bench_test.go:444: initializing random source data
ok      github.com/bmatsuo/lmdb-go/lmdb 2.616s

The 4 allocations reported are difficult to track down. ~~One should be happening in the benchmark function itself~~ (edit: that is false, allocations in the benchmark function get amortized and do not have a significant effect). But the others are not easy to account for. In a normal (non-error) case Txn.Put does not appear to make allocations inside the Go runtime.

@lmb ran the pprof tool using -alloc_objects and the following information about allocations in the benchmark function was comptued:

(pprof) list lmdb.BenchmarkTxn_Put.func1
Total: 506370
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.BenchmarkTxn_Put.func1 in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/bench_test.go
         0     440425 (flat, cum) 86.98% of Total
         .          .     77:   err = env.Update(func(txn *Txn) (err error) {
         .          .     78:       b.ResetTimer()
         .          .     79:       defer b.StopTimer()
         .          .     80:       for i := 0; i < b.N; i++ {
         .          .     81:           k := ps[rand.Intn(len(ps)/2)*2]
         .         21     82:           v := makeBenchDBVal(&rc)
         .     440404     83:           err := txn.Put(dbi, k, v, 0)
         .          .     84:           if err != nil {
         .          .     85:               return err
         .          .     86:           }
         .          .     87:       }
         .          .     88:       return nil

Nothing seems to be adding up correctly here.

lmb commented 8 years ago

So, here is how I interpret this:

The for loop runs N times
For N runs, makeBenchDBVal makes 21 allocations
For N runs, Txn.Put makes 440404 allocations

Now, considering that N is kind of large (a few 100k probably), allocations from makeBenchDBVal do not factor in. The 4 allocations per op are exclusively from Txn.Put. Not very exciting so far.

But where are those allocations happening in Txn.Put (this is a different run from the one above, so absolute numbers won't match up)?

Total: 906361
ROUTINE ======================== github.com/bmatsuo/lmdb-go/lmdb.(*Txn).Put in /Users/lorenz/go/src/github.com/bmatsuo/lmdb-go/lmdb/txn.go
    905940     905940 (flat, cum)   100% of Total
         .          .    314:   vn := len(val)
         .          .    315:   if vn == 0 {
         .          .    316:       val = []byte{0}
         .          .    317:   }
         .          .    318:
    905940     905940    319:   ret := C.lmdbgo_mdb_put2(
         .          .    320:       txn._txn, C.MDB_dbi(dbi),
         .          .    321:       unsafe.Pointer(&key[0]), C.size_t(kn),
         .          .    322:       unsafe.Pointer(&val[0]), C.size_t(vn),
         .          .    323:       C.uint(flags),
         .          .    324:   )

L320: No allocations here? Copied to the stack
L321: 1 allocation?
L322: 1 allocation?
L323: No allocations again

The only way I can make the numbers match up is if unsafe.Pointer(&key[0]) does two allocations. Which I don't understand.

bmatsuo commented 8 years ago

Ah. I was partially misreading the code. Indeed makeBenchDBVal will only allocate a fraction of the time. So it will not have an effect on the benchmark's computed allocs/op.

Making those numbers add up is a stretch. Honestly given some of the weird swings in benchmark numbers I have seen before I have wondered if cgo could somehow be allocating or copying things which I don't expect. I never had real evidence for that before this.

edit: this was one of my wacky theories about slowdowns experienced trying to fix #56

bmatsuo commented 8 years ago

Well this is pretty fascinating. Check out the following example: http://play.golang.org/p/hC-b5VSG58

Now look at this benchmark: http://play.golang.org/p/e37ObBxwJP

And the results

$ go test -bench=. -benchmem
testing: warning: no tests to run
PASS
BenchmarkFoo-4           5000000               297 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_nil-4      10000000               148 ns/op               0 B/op          0 allocs/op
ok      github.com/bmatsuo/lmdb-go/tmp/cgo-allocs       3.433s

Shocking

bmatsuo commented 8 years ago

Wow. These benchmarks (for the earlier example) tell another interesting story: http://play.golang.org/p/vLQoKQRxPP

$ go test -bench=. -benchmem                                                                                                                  

testing: warning: no tests to run
PASS
BenchmarkFoo_1-4         5000000               271 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_100-4       2000000               640 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_1000-4       500000              3933 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_10000-4       50000             36555 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_100000-4       5000            362154 ns/op              69 B/op          2 allocs/op
BenchmarkFoo_1000000-4       500           3610440 ns/op            2063 B/op          2 allocs/op
BenchmarkFoo_10000000-4       30          36416743 ns/op          333465 B/op          2 allocs/op
ok      github.com/bmatsuo/lmdb-go/tmp/cgo-allocs       12.931s

I guess the cgo pointer checks are doing a linear scan for pointers despite the fact that it is given a []byte.

edit: Holy hell! I didn't even notice that alloced space increases at the end! What is going on?

lmb commented 8 years ago

That is indeed fascinating! Does DEBUG=cgocheck=0 change that?

bmatsuo commented 8 years ago

Execellent question. The answer... ~~NOT REALLY!! WTF?!~~ (edit: yup, when the correct variable, GODEBUG is used).

GODEBUG=cgocheck=0 go test -bench=. -benchmem
testing: warning: no tests to run
PASS
BenchmarkFoo_1-4         5000000               250 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_100-4       5000000               259 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_1000-4      5000000               256 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_10000-4     5000000               256 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_100000-4    5000000               258 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_1000000-4   5000000               259 ns/op              48 B/op          2 allocs/op
BenchmarkFoo_10000000-4  5000000               254 ns/op              50 B/op          2 allocs/op
ok      github.com/bmatsuo/lmdb-go/tmp/cgo-allocs       10.829s

bmatsuo commented 8 years ago

I don't know why the allocated space went down.. It doesn't actually seem correlated with the variable setting.

bmatsuo commented 8 years ago

In fact, the name was wrong, the environment variable is GODEBUG, not DEBUG. Oops.

That makes the linear runtime go away, but the allocations are still there. I'll update my old comment.

lmb commented 8 years ago

See https://github.com/golang/go/blob/master/src/runtime/cgocall.go#L327 ?

// Complicating matters, taking the address of a slice or array
// element permits the C program to access all elements of the slice
// or array. In that case we will see a pointer to a single element,
// but we need to check the entire data structure.

bmatsuo commented 8 years ago

I didn't completely realize this was a dumb runtime check which did not benefit at all from type information about the argument (presumably because it has already been converted to unsafe.Pointer).

I've also noted that this behavior seems to be ~~restricted to (void*) arguments in C~~ (edit: dependent on C argument types). If I change the C function signature to int foo(char *x), and convert the unsafe.Pointer to type *C.char then all allocations disappear and the runtime is ~constant (with or without GODEBUG).

$ go test -bench=. -benchmem
testing: warning: no tests to run
PASS
BenchmarkFoo_1-4        10000000               149 ns/op               0 B/op          0 allocs/op
BenchmarkFoo_100-4      10000000               149 ns/op               0 B/op          0 allocs/op
BenchmarkFoo_1000-4     10000000               149 ns/op               0 B/op          0 allocs/op
BenchmarkFoo_10000-4    10000000               149 ns/op               0 B/op          0 allocs/op
BenchmarkFoo_100000-4   10000000               149 ns/op               0 B/op          0 allocs/op
BenchmarkFoo_1000000-4  10000000               150 ns/op               0 B/op          0 allocs/op
BenchmarkFoo_10000000-4 10000000               149 ns/op               1 B/op          0 allocs/op
ok      github.com/bmatsuo/lmdb-go/tmp/cgo-allocs       11.555s

Weirdly enough the very last test says 1 byte is allocated per loop iteration. That is strange indeed. But probably not that important.

lmb commented 8 years ago

Yes, I think you are right. The check tries to be smart and aborts if it knows that the type can not contain pointers, but that plan is foiled by conversion to unsafe.Pointer. See https://github.com/golang/go/blob/master/src/runtime/cgocall.go#L466

Trying to dig up where the type coercion to unsafe.Pointer is happening, no luck :(

lmb commented 8 years ago

Tadaaa https://github.com/golang/go/issues/14387

bmatsuo commented 8 years ago

Fantastic detective work, @lmb. It makes me quite happy that this should be fixed before go1.7.

After noticing that the allocations were tied to void* arguments, I had a go and replacing all of them with char* (which wasn't hard because I already had to proxy all the txn ops in lmdbgo.{c,h} due to cgo pointer checks and go1.6). The commit is ~~7d11641~~ (edit: force pushed to 863bf7d because I mistyped the issue number in the commit message).

The results look good. Though I'm going to look into that Go issue further, and probably try things out on tip/master before doing anything with that commit.

Here are the tentative numbers though. I always love seeing "-100%" :smile:

benchmark                              old ns/op     new ns/op     delta
BenchmarkEnv_ReaderList-4              51303         56360         +9.86%
BenchmarkTxn_Put-4                     1773          1496          -15.62%
BenchmarkTxn_PutReserve-4              1701          1556          -8.52%
BenchmarkTxn_PutReserve_writemap-4     1420          1294          -8.87%
BenchmarkTxn_Put_writemap-4            1494          1234          -17.40%
BenchmarkTxn_Get_ro-4                  1480          1331          -10.07%
BenchmarkTxn_Get_raw_ro-4              854           729           -14.64%
BenchmarkScan_ro-4                     5595657       5582831       -0.23%
BenchmarkScan_raw_ro-4                 1195792       1204928       +0.76%
BenchmarkCursor-4                      692           700           +1.16%
BenchmarkCursor_Renew-4                177           178           +0.56%
BenchmarkTxn_Sub_commit-4              188845        188734        -0.06%
BenchmarkTxn_Sub_abort-4               193656        194200        +0.28%
BenchmarkTxn_abort-4                   14458         14443         -0.10%
BenchmarkTxn_commit-4                  14307         14229         -0.55%
BenchmarkTxn_ro-4                      144698        148477        +2.61%
BenchmarkTxn_unmanaged_abort-4         14347         14632         +1.99%
BenchmarkTxn_unmanaged_commit-4        14625         14477         -1.01%
BenchmarkTxn_unmanaged_ro-4            145135        154876        +6.71%
BenchmarkTxn_renew-4                   531           531           +0.00%
BenchmarkTxn_Put_append-4              682           396           -41.94%
BenchmarkTxn_Put_append_noflag-4       839           575           -31.47%

benchmark                              old allocs     new allocs     delta
BenchmarkEnv_ReaderList-4              113            113            +0.00%
BenchmarkTxn_Put-4                     4              0              -100.00%
BenchmarkTxn_PutReserve-4              2              0              -100.00%
BenchmarkTxn_PutReserve_writemap-4     2              0              -100.00%
BenchmarkTxn_Put_writemap-4            4              0              -100.00%
BenchmarkTxn_Get_ro-4                  3              1              -66.67%
BenchmarkTxn_Get_raw_ro-4              3              1              -66.67%
BenchmarkScan_ro-4                     8151           8163           +0.15%
BenchmarkScan_raw_ro-4                 5              5              +0.00%
BenchmarkCursor-4                      3              3              +0.00%
BenchmarkCursor_Renew-4                0              0              +0.00%
BenchmarkTxn_Sub_commit-4              3              3              +0.00%
BenchmarkTxn_Sub_abort-4               3              3              +0.00%
BenchmarkTxn_abort-4                   5              5              +0.00%
BenchmarkTxn_commit-4                  5              5              +0.00%
BenchmarkTxn_ro-4                      5              5              +0.00%
BenchmarkTxn_unmanaged_abort-4         5              5              +0.00%
BenchmarkTxn_unmanaged_commit-4        5              5              +0.00%
BenchmarkTxn_unmanaged_ro-4            5              5              +0.00%
BenchmarkTxn_renew-4                   0              0              +0.00%
BenchmarkTxn_Put_append-4              5              1              -80.00%
BenchmarkTxn_Put_append_noflag-4       5              1              -80.00%

benchmark                              old bytes     new bytes     delta
BenchmarkEnv_ReaderList-4              7408          7408          +0.00%
BenchmarkTxn_Put-4                     96            0             -100.00%
BenchmarkTxn_PutReserve-4              48            0             -100.00%
BenchmarkTxn_PutReserve_writemap-4     48            0             -100.00%
BenchmarkTxn_Put_writemap-4            96            0             -100.00%
BenchmarkTxn_Get_ro-4                  1148          1099          -4.27%
BenchmarkTxn_Get_raw_ro-4              66            18            -72.73%
BenchmarkScan_ro-4                     9127785       9130894       +0.03%
BenchmarkScan_raw_ro-4                 72            72            +0.00%
BenchmarkCursor-4                      33            33            +0.00%
BenchmarkCursor_Renew-4                0             0             +0.00%
BenchmarkTxn_Sub_commit-4              65            65            +0.00%
BenchmarkTxn_Sub_abort-4               65            65            +0.00%
BenchmarkTxn_abort-4                   97            97            +0.00%
BenchmarkTxn_commit-4                  97            97            +0.00%
BenchmarkTxn_ro-4                      97            97            +0.00%
BenchmarkTxn_unmanaged_abort-4         97            97            +0.00%
BenchmarkTxn_unmanaged_commit-4        97            97            +0.00%
BenchmarkTxn_unmanaged_ro-4            97            97            +0.00%
BenchmarkTxn_renew-4                   0             0             +0.00%
BenchmarkTxn_Put_append-4              104           8             -92.31%
BenchmarkTxn_Put_append_noflag-4       104           8             -92.31%

bmatsuo commented 8 years ago

After testing on tip it seems that the allocations remain present with void* arguments (though, indeed runtime is constant on that example benchmark).

But the allocs mean I may end up merging that char* change. I'll keep investigating a bit longer. I might bring the allocations up on one of the mailing lists.

lmb commented 8 years ago

Nice :+1: Maybe this will make the benchmarks less volatile as well.

bmatsuo commented 8 years ago

I wouldn't get my hopes too high about that :smile:

Even in those numbers I pasted there is some weirdness (e.g. ReaderList +10%?)

The interaction with the system is so high. My suspicion is that you need a real clean-room style environment to run the benchmarks if you want truly consistent times. Something like busybox with minimal external processes, running everything in memory, and where any kind of virtualization has been disabled. IDK though.

bmatsuo commented 8 years ago

I'm tempted to make this part of the next release as well. But it will wait until I can get some more insight about whether those allocations would feasibly be eliminated somehow in the near future.

bmatsuo commented 8 years ago

Yea. I think those allocations are going to be around for a while. I made an issue about it but it doesn't seem to be a high priority.

https://github.com/golang/go/issues/15048

I'm preparing a pull request.

bmatsuo / lmdb-go

lmdb: strange numbers of allocations reported by benchmarks #63