Use a Sample pool to avoid allocations.

beorn7 commented 10 years ago

With the current state, inserting is pretty heavy on the GC. On average, 1 alloc happens per insert. In situations where you want to handle millions of inserts in a short time, that's more or less prohibitive.

The Sample pool implemented here can be replaced by Go 1.3 pools once dependency on 1.3 is acceptable.

Benchmarks:

$ go test -bench=. -benchmem
PASS
BenchmarkInsertTargeted 10000000               159 ns/op              32 B/op          1 allocs/op
BenchmarkInsertTargetedSmallEpsilon      5000000               374 ns/op              32 B/op          1 allocs/op
BenchmarkInsertBiased   10000000               165 ns/op              32 B/op          1 allocs/op
BenchmarkInsertBiasedSmallEpsilon        5000000               877 ns/op              32 B/op          1 allocs/op
BenchmarkQuery  20000000                87.1 ns/op             0 B/op          0 allocs/op
BenchmarkQuerySmallEpsilon        500000              6046 ns/op               0 B/op          0 allocs/op
ok      github.com/bmizerany/perks/quantile     19.193s

$ go test -bench=. -benchmem
PASS
BenchmarkInsertTargeted 10000000               158 ns/op               0 B/op          0 allocs/op
BenchmarkInsertTargetedSmallEpsilon      5000000               363 ns/op               0 B/op          0 allocs/op
BenchmarkInsertBiased   10000000               158 ns/op               0 B/op          0 allocs/op
BenchmarkInsertBiasedSmallEpsilon        5000000               798 ns/op               0 B/op          0 allocs/op
BenchmarkQuery  20000000                82.5 ns/op             0 B/op          0 allocs/op
BenchmarkQuerySmallEpsilon        500000              4759 ns/op               0 B/op          0 allocs/op
ok      github.com/bmizerany/perks/quantile     17.783s

beorn7 commented 10 years ago

@bmizerany I hope you find that interesting... Cheers.

bmizerany commented 10 years ago

I'm interested. Can you show the deltas in performance in this ticket?

http://golang.org/misc/benchcmp

bmizerany commented 10 years ago

Can you create a pool type and a pool_1_2.go and pool_1_3.go that implement it? The later would use sync.Pool in 1.3?

beorn7 commented 10 years ago

benchcmp output:

benchmark old ns/op new ns/op delta BenchmarkInsertTargeted 152 157 +3.29% BenchmarkInsertTargetedSmallEpsilon 368 361 -1.90% BenchmarkInsertBiased 163 158 -3.07% BenchmarkInsertBiasedSmallEpsilon 870 787 -9.54% BenchmarkQuery 85 82 -2.82% BenchmarkQuerySmallEpsilon 5967 4647 -22.12%

benchmark old allocs new allocs delta BenchmarkInsertTargeted 1 0 -100.00% BenchmarkInsertTargetedSmallEpsilon 1 0 -100.00% BenchmarkInsertBiased 1 0 -100.00% BenchmarkInsertBiasedSmallEpsilon 1 0 -100.00% BenchmarkQuery 0 0 n/a% BenchmarkQuerySmallEpsilon 0 0 n/a%

benchmark old bytes new bytes delta BenchmarkInsertTargeted 32 0 -100.00% BenchmarkInsertTargetedSmallEpsilon 32 0 -100.00% BenchmarkInsertBiased 32 0 -100.00% BenchmarkInsertBiasedSmallEpsilon 32 0 -100.00% BenchmarkQuery 0 0 n/a% BenchmarkQuerySmallEpsilon 0 0 n/a%

beorn7 commented 10 years ago

Will look into 1.3 specific build as soon as I find time...

bmizerany commented 10 years ago

Awesome. This is great!

beorn7 commented 10 years ago

Interestingly, the go1.3 pool is slightly slower than the channel based manually implemented pool:

benchcmp new-1.2-pool.txt new-1.3-pool.txt benchmark old ns/op new ns/op delta BenchmarkInsertTargeted 158 181 +14.56% BenchmarkInsertTargetedSmallEpsilon 366 388 +6.01% BenchmarkInsertBiased 158 183 +15.82% BenchmarkInsertBiasedSmallEpsilon 808 825 +2.10% BenchmarkQuery 82 85 +3.14% BenchmarkQuerySmallEpsilon 4541 4889 +7.66%

benchmark old allocs new allocs delta BenchmarkInsertTargeted 0 0 n/a% BenchmarkInsertTargetedSmallEpsilon 0 0 n/a% BenchmarkInsertBiased 0 0 n/a% BenchmarkInsertBiasedSmallEpsilon 0 0 n/a% BenchmarkQuery 0 0 n/a% BenchmarkQuerySmallEpsilon 0 0 n/a%

benchmark old bytes new bytes delta BenchmarkInsertTargeted 0 0 n/a% BenchmarkInsertTargetedSmallEpsilon 0 0 n/a% BenchmarkInsertBiased 0 0 n/a% BenchmarkInsertBiasedSmallEpsilon 0 0 n/a% BenchmarkQuery 0 0 n/a% BenchmarkQuerySmallEpsilon 0 0 n/a%

However, with many streams used at the same time, the go1.3 sync.Pool will save resources, as we only need one global instance (and not one per stream).

bmizerany commented 10 years ago

Hrm. I'm interested to know why things slow down.

beorn7 commented 10 years ago

Yeah, good question... The sync.Pool implementation looks pretty sophisticated. Perhaps the necessary type assertion causes that overhead?

Anyway, if you prefer speed over resource usage, we can just stick with the channel based pool (which exists once per stream instance and therefore consumes more memory).

bmizerany commented 10 years ago

The problem, as as pointed out by Dimtry (of golang) is that your copying the sample instead of using a pointer, which causes allocations and is slowing things down.

On Tuesday, June 24, 2014, Björn Rabenstein notifications@github.com wrote:

Yeah, good question... The sync.Pool implementation looks pretty sophisticated. Perhaps the necessary type assertion causes that overhead?

Anyway, if you prefer speed over resource usage, we can just stick with the channel based pool (which exists once per stream instance and therefore consumes more memory).

— Reply to this email directly or view it on GitHub https://github.com/bmizerany/perks/pull/12#issuecomment-46949861.

beorn7 commented 10 years ago

Could you help me and point out where my change introduces copying of a sample?

Unless -benchmem is mistaken, there are zero allocations happening after my change upon sample insertion. Where do you see an allocation?

bmizerany commented 10 years ago

Huh. He said he saw something there, but I'm not seeing it either.

bmizerany commented 10 years ago

Okay. Here is the status:

I learned that using anything larger than a WORD will cause a copy into and out of sync.Pool. So pointers are fine, but not structs with three fields.

That is good to know, but doesn't help us here. We can remove the allocs by changing s.l from []*Sample to []Sample. My first few attempts broke the tests. If you are curious and want to try yourself, please do. Even though the tests are broke, the benchmarks sped-up considerably and allocs when to 0.

[UPDATE]

This also removes the synchronization necessary for sync.Pool.

bmizerany commented 10 years ago

I pushed the changes here (again, they are broken): https://github.com/bmizerany/perks/tree/broken-but-fast

You can see here it removes the allocs:

$ go test -bench=BenchmarkInsertTargeted$ -run=none
PASS
BenchmarkInsertTargeted 20000000           118 ns/op           0 B/op          0 allocs/op
ok      github.com/bmizerany/perks/quantile 2.502s
$

beorn7 commented 10 years ago

So, your change addresses the alloc problem in a different way, which is even faster in general. Nice.

In #13 , I suggested a fix for your branch. If that works, we can drop this PR.

bmizerany commented 10 years ago

Fixed in #13! Thank you!

bmizerany / perks

Use a Sample pool to avoid allocations. #12