Open miekg opened 7 years ago
Some simple queryperfing to get baseline:
0.2: Queries per second: 58800.788936 qps 0.3: Queries per second: 74555.654706 qps
Doing SO_REUSEADDR did not make any impact at all! REUSEPORT
even made performace worse, don't know what I saw in #4, but can't easily replicate it.
Profile out when profiing: All this is done against the n1-standard-4 instance Queries per second: 51117.913849 qps
% go tool pprof pprof.coredns.localhost:6053.samples.cpu.001.pb.gz ~/pprof deb
Entering interactive mode (type "help" for commands)
(pprof) top
24650ms of 48350ms total (50.98%)
Dropped 209 nodes (cum <= 241.75ms)
Showing top 10 nodes out of 161 (cum >= 1770ms)
flat flat% sum% cum cum%
9740ms 20.14% 20.14% 10280ms 21.26% syscall.Syscall
4170ms 8.62% 28.77% 4170ms 8.62% runtime.futex
2400ms 4.96% 33.73% 5760ms 11.91% runtime.mallocgc
1980ms 4.10% 37.83% 2740ms 5.67% runtime.scanobject
1490ms 3.08% 40.91% 1490ms 3.08% runtime.epollwait
1170ms 2.42% 43.33% 3290ms 6.80% runtime.pcvalue
1040ms 2.15% 45.48% 1040ms 2.15% runtime.readvarint
1000ms 2.07% 47.55% 1000ms 2.07% runtime.usleep
860ms 1.78% 49.33% 5830ms 12.06% runtime.gentraceback
800ms 1.65% 50.98% 1770ms 3.66% github.com/coredns/coredns/vendor/github.com/miekg/dns.packDomainName
This looks pretty OK-ish.
pprof) cum
(pprof) top10
10.57s of 48.35s total (21.86%)
Dropped 209 nodes (cum <= 0.24s)
Showing top 10 nodes out of 161 (cum >= 9.85s)
flat flat% sum% cum cum%
0 0% 0% 35.03s 72.45% runtime.goexit
0.23s 0.48% 0.48% 24.35s 50.36% github.com/coredns/coredns/vendor/github.com/miekg/dns.(*Server).serve
0.02s 0.041% 0.52% 20.41s 42.21% github.com/coredns/coredns/vendor/github.com/miekg/dns.HandlerFunc.ServeDNS
0.02s 0.041% 0.56% 20.39s 42.17% github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1
0.21s 0.43% 0.99% 20.21s 41.80% github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS
0.02s 0.041% 1.03% 19.21s 39.73% github.com/coredns/coredns/middleware/whoami.(*Whoami).ServeDNS
0.26s 0.54% 1.57% 19.19s 39.69% github.com/coredns/coredns/middleware/whoami.Whoami.ServeDNS
0.05s 0.1% 1.68% 13.20s 27.30% github.com/coredns/coredns/vendor/github.com/miekg/dns.(*response).WriteMsg
9.74s 20.14% 21.82% 10.28s 21.26% syscall.Syscall
0.02s 0.041% 21.86% 9.85s 20.37% github.com/coredns/coredns/vendor/github.com/miekg/dns.(*response).Write
Uploading the profile pprof.coredns.localhost:6053.samples.cpu.001.pb.gz
Looking at the web view syscall.Syscall
again stands out.
(pprof) text syscall.Syscall
10.21s of 48.35s total (21.12%)
Dropped 2 nodes (cum <= 0.24s)
Showing top 10 nodes out of 37 (cum >= 7.88s)
flat flat% sum% cum cum%
9.74s 20.14% 20.14% 10.28s 21.26% syscall.Syscall
0.20s 0.41% 20.56% 0.20s 0.41% runtime.casgstatus
0.12s 0.25% 20.81% 0.32s 0.66% runtime.reentersyscall
0.07s 0.14% 20.95% 0.21s 0.43% runtime.exitsyscall
0.07s 0.14% 21.10% 0.07s 0.14% runtime.futex
0.01s 0.021% 21.12% 0.33s 0.68% runtime.entersyscall
0 0% 21.12% 7.88s 16.30% github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS
0 0% 21.12% 2.40s 4.96% github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket
0 0% 21.12% 7.88s 16.30% github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1
0 0% 21.12% 7.88s 16.30% github.com/coredns/coredns/middleware/whoami.(*Whoami).ServeDNS
Well, syscall.Syscall
just does the syscall and that takes time, (pprof) list syscall.Syscall
shows the source which lists the children above (runtime.casgstatus
and friends). So far this is actually WAI, but it does feel that something needs to be optimized here.
There are 2 things:
Ok
(pprof) list syscall.sendmsg
Total: 48.35s
ROUTINE ======================== syscall.sendmsg in /home/miek/upstream/go/src/syscall/zsyscall_linux_amd64.go
10ms 7.89s (flat, cum) 16.32% of Total
. . 1745:}
. . 1746:
. . 1747:// THIS FILE IS GENERATED BY THE COMMAND AT THE TOP; DO NOT EDIT
. . 1748:
. . 1749:func sendmsg(s int, msg *Msghdr, flags int) (n int, err error) {
. 7.88s 1750: r0, _, e1 := Syscall(SYS_SENDMSG, uintptr(s), uintptr(unsafe.Pointer(msg)), uintptr(flags))
. . 1751: n = int(r0)
. . 1752: if e1 != 0 {
. . 1753: err = errnoErr(e1)
. . 1754: }
10ms 10ms 1755: return
. . 1756:}
. . 1757:
. . 1758:// THIS FILE IS GENERATED BY THE COMMAND AT THE TOP; DO NOT EDIT
. . 1759:
. . 1760:func mmap(addr uintptr, length uintptr, prot int, flags int, fd int, offset int64) (xaddr uintptr, err error) {
The syscall is sending the message. Makes sense looks legit, i.e. answer 1) from above
Ah, the slowness might be the use of
Not true, this is called once on the socket.defer()
in my patch to Go DNS. I'll retest.
No using any of the functions in udp_linux.go
, results in similar profile. With a large chunk taken by Syscall:
26030ms of 52680ms total (49.41%)
Dropped 220 nodes (cum <= 263.40ms)
Showing top 10 nodes out of 156 (cum >= 2040ms)
flat flat% sum% cum cum%
9430ms 17.90% 17.90% 9890ms 18.77% syscall.Syscall
4550ms 8.64% 26.54% 4550ms 8.64% runtime.futex
2760ms 5.24% 31.78% 6890ms 13.08% runtime.mallocgc
2690ms 5.11% 36.88% 3600ms 6.83% runtime.scanobject
1600ms 3.04% 39.92% 4200ms 7.97% runtime.pcvalue
1250ms 2.37% 42.29% 1250ms 2.37% runtime.readvarint
1200ms 2.28% 44.57% 1200ms 2.28% runtime.epollwait
970ms 1.84% 46.41% 2400ms 4.56% github.com/coredns/coredns/vendor/github.com/miekg/dns.packDomainName
790ms 1.50% 47.91% 6850ms 13.00% runtime.gentraceback
790ms 1.50% 49.41% 2040ms 3.87% runtime.step
It is nice to see dns.packDomainName
, but just writing responses takes 3x longer....
OK, completely removing OOB UDP sending does not impact performace, hovers around 70K qps.
Going to kernel for sending these udp packets is slow (and we do it for every packet). Now starting to wonder how other UDP servers deal with this.
@johnbelamaric do you know how this is optimized in C programs?
This also explains the speedup that I saw locally (although not the 3x increase), but AFAICT linux optimizes out the network when going through localhost - which means no kernel context switch..???
https://groups.google.com/forum/#!topic/golang-nuts/tBxagDKvrm8 mentions go DNS (sigh) and slowness of syscall on Linux
Might make sense to write super simple udp echo server from scratch in go and see how that performs.
As for optimization in C, no, but let me ping some folks here and see if they do.
Thanks. The core question seems to be how long should it take to write 200 B to the UDP socket and is this slower than it should be in Go (or as it stands now: Go DNS)
This is a supersimple Go UDP server: https://gist.github.com/miekg/d9bc045c89578f3cc66a214488e68227
Doing the same queryperf in GCE yields:
./bin/queryperf-linux_amd64 -d domain.lst.queryperf -l 5 -s 10.132.0.3 -p 1053
DNS Query Performance Testing Tool
Version: $Id: queryperf.c,v 1.12 2007/09/05 07:36:04 marka Exp $
[Status] Processing input data
[Status] Sending queries (beginning with 10.132.0.3)
[Status] Testing complete
Statistics:
Parse input file: multiple times
Run time limit: 5 seconds
Ran through file: 144313 times
Queries sent: 432942 queries
Queries completed: 432942 queries
Queries lost: 0 queries
Queries delayed(?): 0 queries
RTT max: 0.002241 sec
RTT min: 0.000026 sec
RTT average: 0.000200 sec
RTT std deviation: 0.000072 sec
RTT out of range: 0 queries
Percentage completed: 100.00%
Percentage lost: 0.00%
Started at: Wed May 10 09:31:13 2017
Finished at: Wed May 10 09:31:18 2017
Ran for: 5.000065 seconds
Queries per second: 86587.274365 qps
~85K qps, 20% faster than Go DNS. Nice but not earth shattering.
LOL with profiling this drops to "Queries per second: 72013.668522 qps"
Fetching profile from http://localhost:6060/debug/pprof/profile
Please wait... (30s)
Saved profile in /home/miek/pprof/pprof.udp.localhost:6060.samples.cpu.001.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top
28760ms of 38180ms total (75.33%)
Dropped 147 nodes (cum <= 190.90ms)
Showing top 10 nodes out of 90 (cum >= 340ms)
flat flat% sum% cum cum%
13440ms 35.20% 35.20% 14220ms 37.24% syscall.Syscall6
6070ms 15.90% 51.10% 6070ms 15.90% runtime.futex
4650ms 12.18% 63.28% 4650ms 12.18% runtime.epollwait
1630ms 4.27% 67.55% 1630ms 4.27% runtime.usleep
620ms 1.62% 69.17% 2200ms 5.76% runtime.mallocgc
590ms 1.55% 70.72% 590ms 1.55% runtime.casgstatus
560ms 1.47% 72.18% 10160ms 26.61% runtime.findrunnable
560ms 1.47% 73.65% 560ms 1.47% runtime.memclrNoHeapPointers
330ms 0.86% 74.52% 340ms 0.89% runtime.unlock
310ms 0.81% 75.33% 340ms 0.89% runtime.lock
(pprof) cum
(pprof) top10
14.22s of 38.18s total (37.24%)
Dropped 147 nodes (cum <= 0.19s)
Showing top 10 nodes out of 90 (cum >= 11.40s)
flat flat% sum% cum cum%
0.01s 0.026% 0.026% 24.36s 63.80% runtime.goexit
0.13s 0.34% 0.37% 16.05s 42.04% main.serve
0.10s 0.26% 0.63% 15.92s 41.70% net.(*UDPConn).WriteTo
0.02s 0.052% 0.68% 15.82s 41.44% net.(*UDPConn).writeTo
0.18s 0.47% 1.15% 14.96s 39.18% net.(*netFD).writeTo
13.44s 35.20% 36.35% 14.22s 37.24% syscall.Syscall6
0.05s 0.13% 36.49% 12.03s 31.51% runtime.mcall
0.20s 0.52% 37.01% 11.67s 30.57% runtime.schedule
0.05s 0.13% 37.14% 11.59s 30.36% syscall.Sendto
0.04s 0.1% 37.24% 11.40s 29.86% syscall.sendto
(pprof) disasm syscall.Syscall
Total: 38.18s
ROUTINE ======================== syscall.Syscall6
13.44s 14.22s (flat, cum) 37.24% of Total
. 520ms 477d00: CALL runtime.entersyscall(SB)
. . 477d05: MOVQ 0x10(SP), DI
. . 477d0a: MOVQ 0x18(SP), SI
. . 477d0f: MOVQ 0x20(SP), DX
. . 477d14: MOVQ 0x28(SP), R10
. . 477d19: MOVQ 0x30(SP), R8
. . 477d1e: MOVQ 0x38(SP), R9
. . 477d23: MOVQ 0x8(SP), AX
. . 477d28: SYSCALL
13.42s 13.42s 477d2a: CMPQ $-0xfff, AX
. . 477d30: JBE 0x477d52
. . 477d32: MOVQ $-0x1, 0x40(SP)
. . 477d3b: MOVQ $0x0, 0x48(SP)
. . 477d44: NEGQ AX
. . 477d47: MOVQ AX, 0x50(SP)
. . 477d4c: CALL runtime.exitsyscall(SB)
. . 477d51: RET
. . 477d52: MOVQ AX, 0x40(SP)
10ms 10ms 477d57: MOVQ DX, 0x48(SP)
. . 477d5c: MOVQ $0x0, 0x50(SP)
10ms 270ms 477d65: CALL runtime.exitsyscall(SB)
UDP echo server in C (does not set the QR bit, just echo's bytes back): https://gist.github.com/miekg/a61d55a8ec6560ad6c4a2747b21e6128
Queryperf: 84343.155846 qps (some order of magnitude)
https://github.com/kavu/go_reuseport makes this somewhat faster in the Go case. Fluctuates in my localhost testing from 60K to 90K
From my colleague Kevin who is a veritable performance guru:
The first question I'd ask is if they're seeing 100% CPU.
If not, then they need to see if their path to sendmsg (including any
internal mutexes) has opportunites to block.
In C, we make sure we never do I/O with a lock. Even with a single
thread though, a simple test does a hell of a lot more than that.
top - 14:09:17 up 1 day, 3:34, 2 users, load average: 1.26, 0.35, 0.12
Tasks: 132 total, 1 running, 131 sleeping, 0 stopped, 0 zombie
%Cpu0 : 17.8 us, 28.4 sy, 0.0 ni, 53.4 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu1 : 18.2 us, 29.6 sy, 0.0 ni, 52.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 19.9 us, 25.2 sy, 0.0 ni, 37.4 id, 0.0 wa, 0.0 hi, 17.5 si, 0.0 st
%Cpu3 : 14.6 us, 31.9 sy, 0.0 ni, 53.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 15398136 total, 13453972 free, 216784 used, 1727380 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14836896 avail Mem
not cpu bound.
Following some of Go code paths, they seem sane-ish, but all kind of stuff is happening because of the goroutine scheduling and so forth.
Thanks for the update
Does the simple echo server use go routines? It would be interesting to see the raw performance of a single-threaded server with no go routines.
That should be simple to do; yes it is using multiple Go routines to write the answer. The reading is done from the main thread. SO_REUSEPORT could be helpful there, but initial/small scale testing didn't see any improvements there
Geh, completely sequential, i.e. go serve()
-> serve()
yields better performance (still all in the 70-80 Kqps range), but more often higher, e.g. 90K
This is totally on-par with the C program now. On my (linode VPS) both progs go up to ~90Kqps with the queryperf as committed to this repo.
userspace <-> kernel space context switches are hard, let's go fishing?
I'm going to put NSD on these nodes tomorrow and check what it does. 'Cause I believe that we should reverse the testing; clearly this 80-100Kqps is the max this "hardware" can do. Let's optimize for that, i.e. cut out the syscall.Syscall of all perfs and see what's left.
Depends a bit on my NSD testing, if that yields 2x qps something is very wrong somewhere.
NSD is even more nonsensical and drops replies, so I get only 140 qps. Setup: nsd 4 (apt-get install) nsd.conf
server:
port: 1053
server-count: 1
ip4-only: yes
zonesdir: "/home/miek/zones"
pidfile: "/home/miek/nsd-pid"
database: "/home/miek/nsd.db"
username: miek
remote-control:
control-enable: no
zone:
name: example.org
zonefile: "/home/miek/zones/example.org"
nsd running with nsd -c nsd.conf -d
Think https://github.com/coredns/perf-tests/issues/7#issuecomment-300608263 is the most sensible way forward if we want to optimize performance.
So basic idea is this: get a udp baseline: with udpserv.go and/or the C program that does the same. Benchmark coredns on the same hardware, see how far off we are --> optimize.
Hello @miekg
Did you ever find time to finish this up?
Edit: never mind, I missed the other open issues, seems like you moved on to using different setups, thanks for all the details!
[ Quoting notifications@github.com in "Re: [coredns/perf-tests] GCE perf (..." ]
Hello @miekg
Did you ever find time to finish this up?
nope :(
I've started playing around in GCE.
3 machines:
All installed with Ubuntu-1703. CoreDNS is 007 (release from github).