Fix poor scalability to many (true-SMP) cores

jonhoo commented 8 years ago

Following up from golang/go#8899.

OneOfOne commented 8 years ago

Thank you very much for taking the time to do this.

jonhoo commented 8 years ago

Happy to help.

80core-intel $ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan     100000000          262 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   100000000          266 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   10000000          1745 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8    5000000          2908 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16   5000000          3179 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32   5000000          3230 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48   5000000          3241 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-64   5000000          3234 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-72   5000000          3219 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-80   5000000          3196 ns/op           8 B/op          1 allocs/op
BenchmarkChan       100000000          205 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000           676 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     10000000          1390 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     20000000           865 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000           690 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           721 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           707 ns/op           8 B/op          1 allocs/op
BenchmarkChan-64    20000000           697 ns/op           8 B/op          1 allocs/op
BenchmarkChan-72    20000000           711 ns/op           8 B/op          1 allocs/op
BenchmarkChan-80    20000000           676 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  476.787s

48core-amd $ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48 -benchtime 10s
PASS
BenchmarkLFChan     50000000           358 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   30000000           529 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   20000000           833 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8   20000000           882 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16  20000000           934 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32  20000000           911 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48  20000000           922 ns/op           8 B/op          1 allocs/op
BenchmarkChan       100000000          266 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000          1200 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     10000000          1202 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     10000000          1361 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000          1129 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           923 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           892 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  356.257s

$ go test -c -o lfchan.test
$ ./lfchan.test -test.bench=LFChan -test.cpu 80 -test.benchtime 10s -test.cpuprofile lf.prof
$ go tool pprof lfchan.test lf.prof
Entering interactive mode (type "help" for commands)
(pprof) top20
42.38s of 43.78s total (96.80%)
Dropped 116 nodes (cum <= 0.22s)
Showing top 20 nodes out of 32 (cum >= 0.41s)
      flat  flat%   sum%        cum   cum%
    21.09s 48.17% 48.17%     21.09s 48.17%  sync/atomic.AddUint32
    10.28s 23.48% 71.65%     10.28s 23.48%  sync/atomic.LoadUint32
     4.62s 10.55% 82.21%     21.27s 48.58%  github.com/OneOfOne/lfchan.Chan.Send
     4.13s  9.43% 91.64%     19.13s 43.70%  github.com/OneOfOne/lfchan.Chan.Recv
     0.73s  1.67% 93.31%      0.73s  1.67%  runtime.futex
     0.71s  1.62% 94.93%      0.71s  1.62%  runtime/internal/atomic.Xadd
     0.30s  0.69% 95.61%      0.47s  1.07%  runtime.mallocgc
     0.25s  0.57% 96.19%      0.73s  1.67%  runtime.convT2E
     0.06s  0.14% 96.32%      0.80s  1.83%  runtime.schedule
     0.05s  0.11% 96.44%      0.48s  1.10%  runtime.lock
     0.04s 0.091% 96.53%      0.87s  1.99%  runtime.gcDrain
     0.03s 0.069% 96.60%      0.39s  0.89%  runtime.newstack
     0.02s 0.046% 96.64%     41.35s 94.45%  github.com/OneOfOne/lfchan.BenchmarkLFChan.func1
     0.02s 0.046% 96.69%      9.81s 22.41%  github.com/OneOfOne/lfchan.Chan.Closed
     0.02s 0.046% 96.73%      0.51s  1.16%  github.com/OneOfOne/lfchan.Chan.Len
     0.02s 0.046% 96.78%      1.28s  2.92%  runtime.goschedImpl
     0.01s 0.023% 96.80%      0.40s  0.91%  runtime.morestack
         0     0% 96.80%      0.45s  1.03%  runtime.findrunnable
         0     0% 96.80%      0.32s  0.73%  runtime.futexsleep
         0     0% 96.80%      0.41s  0.94%  runtime.futexwakeup

OneOfOne commented 8 years ago

I completely removed the spinlock, it adds alloc overhead sadly but in theory, should work better on your intel box.

If the changes don't add a noticeable improvement, can you please test with sed -e s/int32/int64/g -i lfchan.go just to test a theory that for some reason the 32bit atomics are slower than 64bit on older processors.

OneOfOne commented 8 years ago

@davecheney

jonhoo commented 8 years ago

@OneOfOne 6ce8c5b66fcc4a1d8760a60110281ea120070c5c performs identically to the previous version on both Intel and AMD. sed -e s/int32/int64/g -i lfchan.go causes a bunch of the tests to fail, so I didn't try that.

OneOfOne commented 8 years ago

@jonhoo I'm completely out of ideas :( my only theory is that using it on a true SMP system is different from a multicore processor, but I have 0 experience with a "true" SMP system.

I'm open for suggestions.

OneOfOne commented 8 years ago

@jonhoo were you testing with go tip or go1.6?

jonhoo commented 8 years ago

@OneOfOne go1.6 I don't know what the right direction to move forward is. Lock-free implementations aren't necessarily scalable, so this could be a more fundamental problem. Hard to say.

OneOfOne commented 8 years ago

@jonhoo I'm testing another theory to see if sharding the queue would help, if you have the time, can you please test again.

jonhoo commented 8 years ago

80-core $
BenchmarkLFChan     30000000           426 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-2   10000000          1548 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-4    1000000         19015 ns/op          49 B/op          5 allocs/op
BenchmarkLFChan-8    5000000          5030 ns/op          42 B/op          4 allocs/op
BenchmarkLFChan-16  10000000          1999 ns/op          41 B/op          4 allocs/op
BenchmarkLFChan-32  10000000          1338 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-48  10000000          1346 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-64  10000000          1384 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-72  10000000          1411 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-80  10000000          1426 ns/op          40 B/op          4 allocs/op
BenchmarkChan       100000000          183 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     50000000           696 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     10000000          1323 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     30000000           918 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000           757 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           679 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           713 ns/op           8 B/op          1 allocs/op
BenchmarkChan-64    20000000           745 ns/op           8 B/op          1 allocs/op
BenchmarkChan-72    20000000           717 ns/op           8 B/op          1 allocs/op
BenchmarkChan-80    20000000           715 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  363.410s

48-core $
BenchmarkLFChan     20000000           628 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-2   30000000           609 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-4   20000000           960 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-8   10000000          1189 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-16  20000000           814 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-32  20000000           801 ns/op          40 B/op          4 allocs/op
BenchmarkLFChan-48  20000000           846 ns/op          40 B/op          4 allocs/op
BenchmarkChan       50000000           254 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000          1092 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     10000000          1163 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     20000000          1084 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000          1026 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           945 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           896 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  249.123s

OneOfOne commented 8 years ago

@jonhoo would you please give it another try.

Also off topic, how do you generate those graphs?

OneOfOne commented 8 years ago

run it with -run NONE, I broke the fifo test, not a big deal for now.

jonhoo commented 8 years ago

80-core machine is busy atm, but below are the results for the AMD machine. I'll run the 80-core benchmark tomorrow.

BenchmarkLFChan     50000000           355 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   50000000           532 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   20000000           808 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8   20000000           807 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16  20000000           896 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32  20000000          1011 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48  10000000          1217 ns/op           8 B/op          1 allocs/op
BenchmarkChan       50000000           247 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000           917 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     10000000          1188 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     20000000          1103 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000           905 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           921 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           878 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  257.589s

I generate the graphs with the following R script:

# bench.plot.r
t <- read.table(file("bench.dat"))
t[,2] = sub("([^-0-9])$", "\\1-1", t[,2])
timpls = regexpr("^[^-]+", t[,2], perl=TRUE)
tcores = regexpr("(?<=-)[0-9]*$", t[,2], perl=TRUE)
t <- data.frame(model=t[,1], impl=regmatches(t[,2], timpls), cores=as.numeric(regmatches(t[,2], tcores)), opss=(1000000000/t[,4])/1000)
t$grp = paste(t$model, t$impl)
t

library(ggplot2)
p <- ggplot(data=t, aes(x=cores, y=opss, color=model, linetype=impl, group=grp))
p <- p + xlim(1,81)
p <- p + geom_line() + geom_point()
p <- p + coord_trans(x = "log2")
p <- p + xlab("# cores") + ylab("kops/s") + ggtitle("Benchmarking lock-free Go channels")
ggsave('bench.png',plot=p,width=6,height=4)

$ cat bench.dat 
AMD   BenchmarkLFChan       20000000           628 ns/op          40 B/op          4 allocs/op
...
Intel BenchmarkLFChan       30000000           426 ns/op          40 B/op          4 allocs/op
...
$ R --no-save < bench.plot.r
$ open bench.png

jonhoo commented 8 years ago

80-core $
BenchmarkLFChan     50000000           251 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   50000000           328 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   30000000           564 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8   20000000          1092 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16  10000000          1241 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32  20000000          1223 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48  10000000          1220 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-64  10000000          1193 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-72  10000000          1235 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-80  10000000          1206 ns/op           8 B/op          1 allocs/op
BenchmarkChan       100000000          182 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     50000000           648 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     20000000          1385 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     20000000           993 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000           809 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           709 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           736 ns/op           8 B/op          1 allocs/op
BenchmarkChan-64    20000000           734 ns/op           8 B/op          1 allocs/op
BenchmarkChan-72    20000000           732 ns/op           8 B/op          1 allocs/op
BenchmarkChan-80    20000000           725 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  356.899s

48-core $
BenchmarkLFChan     50000000           334 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-2   50000000           332 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-4   30000000           410 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-8   20000000           821 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-16  20000000           800 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-32  20000000           803 ns/op           8 B/op          1 allocs/op
BenchmarkLFChan-48  20000000           814 ns/op           8 B/op          1 allocs/op
BenchmarkChan       50000000           251 ns/op           8 B/op          1 allocs/op
BenchmarkChan-2     20000000          1153 ns/op           8 B/op          1 allocs/op
BenchmarkChan-4     20000000          1235 ns/op           8 B/op          1 allocs/op
BenchmarkChan-8     20000000          1172 ns/op           8 B/op          1 allocs/op
BenchmarkChan-16    20000000           999 ns/op           8 B/op          1 allocs/op
BenchmarkChan-32    20000000           921 ns/op           8 B/op          1 allocs/op
BenchmarkChan-48    20000000           895 ns/op           8 B/op          1 allocs/op
ok      github.com/OneOfOne/lfchan  261.502s

OneOfOne commented 8 years ago

The results on the amd box looks promising with that last change.

jonhoo commented 8 years ago

Yup. It's somewhat disconcerting that the performance is going down, and not up, as you add more cores though. Theoretically, the total amount of work the system should be able to do should increase. You can see that trend for the standard Go channels.

OneOfOne commented 8 years ago

In all my local tests, lfchan performance go up with the number of cores and go channels go down.

I wonder if this is somewhat related to https://github.com/golang/go/issues/14406 or https://github.com/golang/go/issues/13805.

I have an idea, I'll make the tests use specialized types instead of interface{}.

Qubitium commented 8 years ago

@OneOfOne Even though the results on true SMP is not as desired but still find your code/and general discussion fascinating. =)

At least for Go tip, any Atomic.Load/Set(&intVar) causes a heap allocation of &intVar pointer based on escape analysis output. You can avoid this by having the var declared as

len int32
die uint32

change to 

len *int32
die *uint32

//add to constructor
len = new(int32)
die = new(uint32)

Since there are lots of tight loop in the code that use atomic.Load, this may help, or not. Worth a try.

aclements commented 8 years ago

@diegomontoya, surely those are already on the heap? There's no point in using atomic operations on variables unless they're shared, and if they're shared they must be in the heap.

(I'm filing a bug against sync/atomic anyway, since the fix is easy and we've already done this for the runtime-internal atomics.)

Qubitium commented 8 years ago

@aclements You are right about the var acted on going to be on heap regardless. But it appears using the *ptr method does make things faster.


package pool

import (
    "sync/atomic"
    "testing"
)

var a int64
var aPtr *int64 = new(int64)
var a32 int32
var a32Ptr *int32 = new(int32)

func BenchmarkA(b *testing.B) {
    for i := 0; i < b.N; i++ {
        atomic.LoadInt64(&a)
    }
}

func BenchmarkAPtr(b *testing.B) {
    for i := 0; i < b.N; i++ {
        atomic.LoadInt64(aPtr)
    }
}

func BenchmarkA32(b *testing.B) {
    for i := 0; i < b.N; i++ {
        atomic.LoadInt32(&a32)
    }
}

func BenchmarkA32Ptr(b *testing.B) {
    for i := 0; i < b.N; i++ {
        atomic.LoadInt32(a32Ptr)
    }
}

BenchmarkA 1000000000            1.99 ns/op        0 B/op          0 allocs/op
BenchmarkAPtr 2000000000             1.81 ns/op        0 B/op          0 allocs/op
BenchmarkA32 1000000000          2.18 ns/op        0 B/op          0 allocs/op
BenchmarkA32Ptr 1000000000           2.16 ns/op        0 B/op          0 allocs/op

On my intel 6820hk 4 core 8 threads laptop. The *int64 always comes up on top vs &int64

@OneOfOne Strangely, int32 atomic operations is consistenly slower than int64 on my linux machine. So switch from int32 to int64 might actually improve things.

OneOfOne commented 8 years ago

I'll give it a try later on

OneOfOne / lfchan

Fix poor scalability to many (true-SMP) cores #3