Open jonhoo opened 8 years ago
Thank you very much for taking the time to do this.
Happy to help.
80core-intel $ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48,64,72,80 -benchtime 10s
PASS
BenchmarkLFChan 100000000 262 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 100000000 266 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 10000000 1745 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 5000000 2908 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 5000000 3179 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 5000000 3230 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 5000000 3241 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-64 5000000 3234 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-72 5000000 3219 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-80 5000000 3196 ns/op 8 B/op 1 allocs/op
BenchmarkChan 100000000 205 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 676 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 10000000 1390 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 20000000 865 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 690 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 721 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 707 ns/op 8 B/op 1 allocs/op
BenchmarkChan-64 20000000 697 ns/op 8 B/op 1 allocs/op
BenchmarkChan-72 20000000 711 ns/op 8 B/op 1 allocs/op
BenchmarkChan-80 20000000 676 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 476.787s
48core-amd $ go test -bench=. -benchmem -cpu 1,2,4,8,16,32,48 -benchtime 10s
PASS
BenchmarkLFChan 50000000 358 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 30000000 529 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 20000000 833 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 20000000 882 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 20000000 934 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 20000000 911 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 20000000 922 ns/op 8 B/op 1 allocs/op
BenchmarkChan 100000000 266 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 1200 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 10000000 1202 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 10000000 1361 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 1129 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 923 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 892 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 356.257s
$ go test -c -o lfchan.test
$ ./lfchan.test -test.bench=LFChan -test.cpu 80 -test.benchtime 10s -test.cpuprofile lf.prof
$ go tool pprof lfchan.test lf.prof
Entering interactive mode (type "help" for commands)
(pprof) top20
42.38s of 43.78s total (96.80%)
Dropped 116 nodes (cum <= 0.22s)
Showing top 20 nodes out of 32 (cum >= 0.41s)
flat flat% sum% cum cum%
21.09s 48.17% 48.17% 21.09s 48.17% sync/atomic.AddUint32
10.28s 23.48% 71.65% 10.28s 23.48% sync/atomic.LoadUint32
4.62s 10.55% 82.21% 21.27s 48.58% github.com/OneOfOne/lfchan.Chan.Send
4.13s 9.43% 91.64% 19.13s 43.70% github.com/OneOfOne/lfchan.Chan.Recv
0.73s 1.67% 93.31% 0.73s 1.67% runtime.futex
0.71s 1.62% 94.93% 0.71s 1.62% runtime/internal/atomic.Xadd
0.30s 0.69% 95.61% 0.47s 1.07% runtime.mallocgc
0.25s 0.57% 96.19% 0.73s 1.67% runtime.convT2E
0.06s 0.14% 96.32% 0.80s 1.83% runtime.schedule
0.05s 0.11% 96.44% 0.48s 1.10% runtime.lock
0.04s 0.091% 96.53% 0.87s 1.99% runtime.gcDrain
0.03s 0.069% 96.60% 0.39s 0.89% runtime.newstack
0.02s 0.046% 96.64% 41.35s 94.45% github.com/OneOfOne/lfchan.BenchmarkLFChan.func1
0.02s 0.046% 96.69% 9.81s 22.41% github.com/OneOfOne/lfchan.Chan.Closed
0.02s 0.046% 96.73% 0.51s 1.16% github.com/OneOfOne/lfchan.Chan.Len
0.02s 0.046% 96.78% 1.28s 2.92% runtime.goschedImpl
0.01s 0.023% 96.80% 0.40s 0.91% runtime.morestack
0 0% 96.80% 0.45s 1.03% runtime.findrunnable
0 0% 96.80% 0.32s 0.73% runtime.futexsleep
0 0% 96.80% 0.41s 0.94% runtime.futexwakeup
I completely removed the spinlock, it adds alloc overhead sadly but in theory, should work better on your intel box.
If the changes don't add a noticeable improvement, can you please test with sed -e s/int32/int64/g -i lfchan.go
just to test a theory that for some reason the 32bit atomics are slower than 64bit on older processors.
@davecheney
@OneOfOne 6ce8c5b66fcc4a1d8760a60110281ea120070c5c performs identically to the previous version on both Intel and AMD. sed -e s/int32/int64/g -i lfchan.go
causes a bunch of the tests to fail, so I didn't try that.
@jonhoo I'm completely out of ideas :( my only theory is that using it on a true SMP system is different from a multicore processor, but I have 0 experience with a "true" SMP system.
I'm open for suggestions.
@jonhoo were you testing with go tip or go1.6?
@OneOfOne go1.6 I don't know what the right direction to move forward is. Lock-free implementations aren't necessarily scalable, so this could be a more fundamental problem. Hard to say.
@jonhoo I'm testing another theory to see if sharding the queue would help, if you have the time, can you please test again.
80-core $
BenchmarkLFChan 30000000 426 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-2 10000000 1548 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-4 1000000 19015 ns/op 49 B/op 5 allocs/op
BenchmarkLFChan-8 5000000 5030 ns/op 42 B/op 4 allocs/op
BenchmarkLFChan-16 10000000 1999 ns/op 41 B/op 4 allocs/op
BenchmarkLFChan-32 10000000 1338 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-48 10000000 1346 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-64 10000000 1384 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-72 10000000 1411 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-80 10000000 1426 ns/op 40 B/op 4 allocs/op
BenchmarkChan 100000000 183 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 50000000 696 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 10000000 1323 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 30000000 918 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 757 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 679 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 713 ns/op 8 B/op 1 allocs/op
BenchmarkChan-64 20000000 745 ns/op 8 B/op 1 allocs/op
BenchmarkChan-72 20000000 717 ns/op 8 B/op 1 allocs/op
BenchmarkChan-80 20000000 715 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 363.410s
48-core $
BenchmarkLFChan 20000000 628 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-2 30000000 609 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-4 20000000 960 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-8 10000000 1189 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-16 20000000 814 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-32 20000000 801 ns/op 40 B/op 4 allocs/op
BenchmarkLFChan-48 20000000 846 ns/op 40 B/op 4 allocs/op
BenchmarkChan 50000000 254 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 1092 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 10000000 1163 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 20000000 1084 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 1026 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 945 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 896 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 249.123s
@jonhoo would you please give it another try.
Also off topic, how do you generate those graphs?
run it with -run NONE
, I broke the fifo test, not a big deal for now.
80-core machine is busy atm, but below are the results for the AMD machine. I'll run the 80-core benchmark tomorrow.
BenchmarkLFChan 50000000 355 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 50000000 532 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 20000000 808 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 20000000 807 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 20000000 896 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 20000000 1011 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 10000000 1217 ns/op 8 B/op 1 allocs/op
BenchmarkChan 50000000 247 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 917 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 10000000 1188 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 20000000 1103 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 905 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 921 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 878 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 257.589s
I generate the graphs with the following R script:
# bench.plot.r
t <- read.table(file("bench.dat"))
t[,2] = sub("([^-0-9])$", "\\1-1", t[,2])
timpls = regexpr("^[^-]+", t[,2], perl=TRUE)
tcores = regexpr("(?<=-)[0-9]*$", t[,2], perl=TRUE)
t <- data.frame(model=t[,1], impl=regmatches(t[,2], timpls), cores=as.numeric(regmatches(t[,2], tcores)), opss=(1000000000/t[,4])/1000)
t$grp = paste(t$model, t$impl)
t
library(ggplot2)
p <- ggplot(data=t, aes(x=cores, y=opss, color=model, linetype=impl, group=grp))
p <- p + xlim(1,81)
p <- p + geom_line() + geom_point()
p <- p + coord_trans(x = "log2")
p <- p + xlab("# cores") + ylab("kops/s") + ggtitle("Benchmarking lock-free Go channels")
ggsave('bench.png',plot=p,width=6,height=4)
$ cat bench.dat
AMD BenchmarkLFChan 20000000 628 ns/op 40 B/op 4 allocs/op
...
Intel BenchmarkLFChan 30000000 426 ns/op 40 B/op 4 allocs/op
...
$ R --no-save < bench.plot.r
$ open bench.png
80-core $
BenchmarkLFChan 50000000 251 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 50000000 328 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 30000000 564 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 20000000 1092 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 10000000 1241 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 20000000 1223 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 10000000 1220 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-64 10000000 1193 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-72 10000000 1235 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-80 10000000 1206 ns/op 8 B/op 1 allocs/op
BenchmarkChan 100000000 182 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 50000000 648 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 20000000 1385 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 20000000 993 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 809 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 709 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 736 ns/op 8 B/op 1 allocs/op
BenchmarkChan-64 20000000 734 ns/op 8 B/op 1 allocs/op
BenchmarkChan-72 20000000 732 ns/op 8 B/op 1 allocs/op
BenchmarkChan-80 20000000 725 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 356.899s
48-core $
BenchmarkLFChan 50000000 334 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-2 50000000 332 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-4 30000000 410 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-8 20000000 821 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-16 20000000 800 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-32 20000000 803 ns/op 8 B/op 1 allocs/op
BenchmarkLFChan-48 20000000 814 ns/op 8 B/op 1 allocs/op
BenchmarkChan 50000000 251 ns/op 8 B/op 1 allocs/op
BenchmarkChan-2 20000000 1153 ns/op 8 B/op 1 allocs/op
BenchmarkChan-4 20000000 1235 ns/op 8 B/op 1 allocs/op
BenchmarkChan-8 20000000 1172 ns/op 8 B/op 1 allocs/op
BenchmarkChan-16 20000000 999 ns/op 8 B/op 1 allocs/op
BenchmarkChan-32 20000000 921 ns/op 8 B/op 1 allocs/op
BenchmarkChan-48 20000000 895 ns/op 8 B/op 1 allocs/op
ok github.com/OneOfOne/lfchan 261.502s
The results on the amd box looks promising with that last change.
Yup. It's somewhat disconcerting that the performance is going down, and not up, as you add more cores though. Theoretically, the total amount of work the system should be able to do should increase. You can see that trend for the standard Go channels.
In all my local tests, lfchan performance go up with the number of cores and go channels go down.
I wonder if this is somewhat related to https://github.com/golang/go/issues/14406 or https://github.com/golang/go/issues/13805.
I have an idea, I'll make the tests use specialized types instead of interface{}
.
@OneOfOne Even though the results on true SMP is not as desired but still find your code/and general discussion fascinating. =)
At least for Go tip, any Atomic.Load/Set(&intVar) causes a heap allocation of &intVar pointer based on escape analysis output. You can avoid this by having the var declared as
len int32
die uint32
change to
len *int32
die *uint32
//add to constructor
len = new(int32)
die = new(uint32)
Since there are lots of tight loop in the code that use atomic.Load, this may help, or not. Worth a try.
@diegomontoya, surely those are already on the heap? There's no point in using atomic operations on variables unless they're shared, and if they're shared they must be in the heap.
(I'm filing a bug against sync/atomic anyway, since the fix is easy and we've already done this for the runtime-internal atomics.)
@aclements You are right about the var acted on going to be on heap regardless. But it appears using the *ptr method does make things faster.
package pool
import (
"sync/atomic"
"testing"
)
var a int64
var aPtr *int64 = new(int64)
var a32 int32
var a32Ptr *int32 = new(int32)
func BenchmarkA(b *testing.B) {
for i := 0; i < b.N; i++ {
atomic.LoadInt64(&a)
}
}
func BenchmarkAPtr(b *testing.B) {
for i := 0; i < b.N; i++ {
atomic.LoadInt64(aPtr)
}
}
func BenchmarkA32(b *testing.B) {
for i := 0; i < b.N; i++ {
atomic.LoadInt32(&a32)
}
}
func BenchmarkA32Ptr(b *testing.B) {
for i := 0; i < b.N; i++ {
atomic.LoadInt32(a32Ptr)
}
}
BenchmarkA 1000000000 1.99 ns/op 0 B/op 0 allocs/op
BenchmarkAPtr 2000000000 1.81 ns/op 0 B/op 0 allocs/op
BenchmarkA32 1000000000 2.18 ns/op 0 B/op 0 allocs/op
BenchmarkA32Ptr 1000000000 2.16 ns/op 0 B/op 0 allocs/op
On my intel 6820hk 4 core 8 threads laptop. The *int64 always comes up on top vs &int64
@OneOfOne Strangely, int32 atomic operations is consistenly slower than int64 on my linux machine. So switch from int32 to int64 might actually improve things.
I'll give it a try later on
Following up from golang/go#8899.