GCE perf - Githubissues

miekg commented 7 years ago

I've started playing around in GCE.

3 machines:

instance-1: n1-standard-2
instance-2: n1-standard-4
sender: n1-standard-4

All installed with Ubuntu-1703. CoreDNS is 007 (release from github).

miekg commented 7 years ago

Some simple queryperfing to get baseline:

0.2: Queries per second: 58800.788936 qps 0.3: Queries per second: 74555.654706 qps

miekg commented 7 years ago

Doing SO_REUSEADDR did not make any impact at all! REUSEPORT even made performace worse, don't know what I saw in #4, but can't easily replicate it.

miekg commented 7 years ago

Profile out when profiing: All this is done against the n1-standard-4 instance Queries per second: 51117.913849 qps

% go tool pprof pprof.coredns.localhost:6053.samples.cpu.001.pb.gz                                                                ~/pprof deb
Entering interactive mode (type "help" for commands)
(pprof) top
24650ms of 48350ms total (50.98%)
Dropped 209 nodes (cum <= 241.75ms)
Showing top 10 nodes out of 161 (cum >= 1770ms)
      flat  flat%   sum%        cum   cum%
    9740ms 20.14% 20.14%    10280ms 21.26%  syscall.Syscall
    4170ms  8.62% 28.77%     4170ms  8.62%  runtime.futex
    2400ms  4.96% 33.73%     5760ms 11.91%  runtime.mallocgc
    1980ms  4.10% 37.83%     2740ms  5.67%  runtime.scanobject
    1490ms  3.08% 40.91%     1490ms  3.08%  runtime.epollwait
    1170ms  2.42% 43.33%     3290ms  6.80%  runtime.pcvalue
    1040ms  2.15% 45.48%     1040ms  2.15%  runtime.readvarint
    1000ms  2.07% 47.55%     1000ms  2.07%  runtime.usleep
     860ms  1.78% 49.33%     5830ms 12.06%  runtime.gentraceback
     800ms  1.65% 50.98%     1770ms  3.66%  github.com/coredns/coredns/vendor/github.com/miekg/dns.packDomainName

miekg commented 7 years ago

This looks pretty OK-ish.

pprof) cum
(pprof) top10
10.57s of 48.35s total (21.86%)
Dropped 209 nodes (cum <= 0.24s)
Showing top 10 nodes out of 161 (cum >= 9.85s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%     35.03s 72.45%  runtime.goexit
     0.23s  0.48%  0.48%     24.35s 50.36%  github.com/coredns/coredns/vendor/github.com/miekg/dns.(*Server).serve
     0.02s 0.041%  0.52%     20.41s 42.21%  github.com/coredns/coredns/vendor/github.com/miekg/dns.HandlerFunc.ServeDNS
     0.02s 0.041%  0.56%     20.39s 42.17%  github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1
     0.21s  0.43%  0.99%     20.21s 41.80%  github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS
     0.02s 0.041%  1.03%     19.21s 39.73%  github.com/coredns/coredns/middleware/whoami.(*Whoami).ServeDNS
     0.26s  0.54%  1.57%     19.19s 39.69%  github.com/coredns/coredns/middleware/whoami.Whoami.ServeDNS
     0.05s   0.1%  1.68%     13.20s 27.30%  github.com/coredns/coredns/vendor/github.com/miekg/dns.(*response).WriteMsg
     9.74s 20.14% 21.82%     10.28s 21.26%  syscall.Syscall
     0.02s 0.041% 21.86%      9.85s 20.37%  github.com/coredns/coredns/vendor/github.com/miekg/dns.(*response).Write

miekg commented 7 years ago

Uploading the profile pprof.coredns.localhost:6053.samples.cpu.001.pb.gz

miekg commented 7 years ago

Looking at the web view syscall.Syscall again stands out.

miekg commented 7 years ago

(pprof) text syscall.Syscall
10.21s of 48.35s total (21.12%)
Dropped 2 nodes (cum <= 0.24s)
Showing top 10 nodes out of 37 (cum >= 7.88s)
      flat  flat%   sum%        cum   cum%
     9.74s 20.14% 20.14%     10.28s 21.26%  syscall.Syscall
     0.20s  0.41% 20.56%      0.20s  0.41%  runtime.casgstatus
     0.12s  0.25% 20.81%      0.32s  0.66%  runtime.reentersyscall
     0.07s  0.14% 20.95%      0.21s  0.43%  runtime.exitsyscall
     0.07s  0.14% 21.10%      0.07s  0.14%  runtime.futex
     0.01s 0.021% 21.12%      0.33s  0.68%  runtime.entersyscall
         0     0% 21.12%      7.88s 16.30%  github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS
         0     0% 21.12%      2.40s  4.96%  github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket
         0     0% 21.12%      7.88s 16.30%  github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1
         0     0% 21.12%      7.88s 16.30%  github.com/coredns/coredns/middleware/whoami.(*Whoami).ServeDNS

miekg commented 7 years ago

Well, syscall.Syscall just does the syscall and that takes time, (pprof) list syscall.Syscall shows the source which lists the children above (runtime.casgstatus and friends). So far this is actually WAI, but it does feel that something needs to be optimized here.

miekg commented 7 years ago

There are 2 things:

Is this syscall.Syscall an abnormal thing?
Should it be faster?

miekg commented 7 years ago

Ok

(pprof) list syscall.sendmsg
Total: 48.35s
ROUTINE ======================== syscall.sendmsg in /home/miek/upstream/go/src/syscall/zsyscall_linux_amd64.go
      10ms      7.89s (flat, cum) 16.32% of Total
         .          .   1745:}
         .          .   1746:
         .          .   1747:// THIS FILE IS GENERATED BY THE COMMAND AT THE TOP; DO NOT EDIT
         .          .   1748:
         .          .   1749:func sendmsg(s int, msg *Msghdr, flags int) (n int, err error) {
         .      7.88s   1750:   r0, _, e1 := Syscall(SYS_SENDMSG, uintptr(s), uintptr(unsafe.Pointer(msg)), uintptr(flags))
         .          .   1751:   n = int(r0)
         .          .   1752:   if e1 != 0 {
         .          .   1753:       err = errnoErr(e1)
         .          .   1754:   }
      10ms       10ms   1755:   return
         .          .   1756:}
         .          .   1757:
         .          .   1758:// THIS FILE IS GENERATED BY THE COMMAND AT THE TOP; DO NOT EDIT
         .          .   1759:
         .          .   1760:func mmap(addr uintptr, length uintptr, prot int, flags int, fd int, offset int64) (xaddr uintptr, err error) {

The syscall is sending the message. Makes sense looks legit, i.e. answer 1) from above

miekg commented 7 years ago

~~Ah, the slowness might be the use of defer() in my patch to Go DNS. I'll retest.~~ Not true, this is called once on the socket.

miekg commented 7 years ago

No using any of the functions in udp_linux.go, results in similar profile. With a large chunk taken by Syscall:

26030ms of 52680ms total (49.41%)
Dropped 220 nodes (cum <= 263.40ms)
Showing top 10 nodes out of 156 (cum >= 2040ms)
      flat  flat%   sum%        cum   cum%
    9430ms 17.90% 17.90%     9890ms 18.77%  syscall.Syscall
    4550ms  8.64% 26.54%     4550ms  8.64%  runtime.futex
    2760ms  5.24% 31.78%     6890ms 13.08%  runtime.mallocgc
    2690ms  5.11% 36.88%     3600ms  6.83%  runtime.scanobject
    1600ms  3.04% 39.92%     4200ms  7.97%  runtime.pcvalue
    1250ms  2.37% 42.29%     1250ms  2.37%  runtime.readvarint
    1200ms  2.28% 44.57%     1200ms  2.28%  runtime.epollwait
     970ms  1.84% 46.41%     2400ms  4.56%  github.com/coredns/coredns/vendor/github.com/miekg/dns.packDomainName
     790ms  1.50% 47.91%     6850ms 13.00%  runtime.gentraceback
     790ms  1.50% 49.41%     2040ms  3.87%  runtime.step

It is nice to see dns.packDomainName, but just writing responses takes 3x longer....

miekg commented 7 years ago

OK, completely removing OOB UDP sending does not impact performace, hovers around 70K qps.

Going to kernel for sending these udp packets is slow (and we do it for every packet). Now starting to wonder how other UDP servers deal with this.

miekg commented 7 years ago

@johnbelamaric do you know how this is optimized in C programs?

miekg commented 7 years ago

This also explains the speedup that I saw locally (although not the 3x increase), but AFAICT linux optimizes out the network when going through localhost - which means no kernel context switch..???

miekg commented 7 years ago

https://groups.google.com/forum/#!topic/golang-nuts/tBxagDKvrm8 mentions go DNS (sigh) and slowness of syscall on Linux

miekg commented 7 years ago

Might make sense to write super simple udp echo server from scratch in go and see how that performs.

johnbelamaric commented 7 years ago

As for optimization in C, no, but let me ping some folks here and see if they do.

miekg commented 7 years ago

Thanks. The core question seems to be how long should it take to write 200 B to the UDP socket and is this slower than it should be in Go (or as it stands now: Go DNS)

miekg commented 7 years ago

This is a supersimple Go UDP server: https://gist.github.com/miekg/d9bc045c89578f3cc66a214488e68227

Doing the same queryperf in GCE yields:

./bin/queryperf-linux_amd64 -d domain.lst.queryperf -l 5 -s 10.132.0.3 -p 1053
DNS Query Performance Testing Tool
Version: $Id: queryperf.c,v 1.12 2007/09/05 07:36:04 marka Exp $
[Status] Processing input data
[Status] Sending queries (beginning with 10.132.0.3)
[Status] Testing complete
Statistics:
  Parse input file:     multiple times
  Run time limit:       5 seconds
  Ran through file:     144313 times
  Queries sent:         432942 queries
  Queries completed:    432942 queries
  Queries lost:         0 queries
  Queries delayed(?):   0 queries
  RTT max:              0.002241 sec
  RTT min:              0.000026 sec
  RTT average:          0.000200 sec
  RTT std deviation:    0.000072 sec
  RTT out of range:     0 queries
  Percentage completed: 100.00%
  Percentage lost:        0.00%
  Started at:           Wed May 10 09:31:13 2017
  Finished at:          Wed May 10 09:31:18 2017
  Ran for:              5.000065 seconds
  Queries per second:   86587.274365 qps

~85K qps, 20% faster than Go DNS. Nice but not earth shattering.

miekg commented 7 years ago

LOL with profiling this drops to "Queries per second: 72013.668522 qps"

miekg commented 7 years ago

Fetching profile from http://localhost:6060/debug/pprof/profile
Please wait... (30s)
Saved profile in /home/miek/pprof/pprof.udp.localhost:6060.samples.cpu.001.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top
28760ms of 38180ms total (75.33%)
Dropped 147 nodes (cum <= 190.90ms)
Showing top 10 nodes out of 90 (cum >= 340ms)
      flat  flat%   sum%        cum   cum%
   13440ms 35.20% 35.20%    14220ms 37.24%  syscall.Syscall6
    6070ms 15.90% 51.10%     6070ms 15.90%  runtime.futex
    4650ms 12.18% 63.28%     4650ms 12.18%  runtime.epollwait
    1630ms  4.27% 67.55%     1630ms  4.27%  runtime.usleep
     620ms  1.62% 69.17%     2200ms  5.76%  runtime.mallocgc
     590ms  1.55% 70.72%      590ms  1.55%  runtime.casgstatus
     560ms  1.47% 72.18%    10160ms 26.61%  runtime.findrunnable
     560ms  1.47% 73.65%      560ms  1.47%  runtime.memclrNoHeapPointers
     330ms  0.86% 74.52%      340ms  0.89%  runtime.unlock
     310ms  0.81% 75.33%      340ms  0.89%  runtime.lock

(pprof) cum
(pprof) top10
14.22s of 38.18s total (37.24%)
Dropped 147 nodes (cum <= 0.19s)
Showing top 10 nodes out of 90 (cum >= 11.40s)
      flat  flat%   sum%        cum   cum%
     0.01s 0.026% 0.026%     24.36s 63.80%  runtime.goexit
     0.13s  0.34%  0.37%     16.05s 42.04%  main.serve
     0.10s  0.26%  0.63%     15.92s 41.70%  net.(*UDPConn).WriteTo
     0.02s 0.052%  0.68%     15.82s 41.44%  net.(*UDPConn).writeTo
     0.18s  0.47%  1.15%     14.96s 39.18%  net.(*netFD).writeTo
    13.44s 35.20% 36.35%     14.22s 37.24%  syscall.Syscall6
     0.05s  0.13% 36.49%     12.03s 31.51%  runtime.mcall
     0.20s  0.52% 37.01%     11.67s 30.57%  runtime.schedule
     0.05s  0.13% 37.14%     11.59s 30.36%  syscall.Sendto
     0.04s   0.1% 37.24%     11.40s 29.86%  syscall.sendto

miekg commented 7 years ago

pprof.udp.localhost:6060.samples.cpu.001.pb.gz

miekg commented 7 years ago

(pprof) disasm syscall.Syscall
Total: 38.18s
ROUTINE ======================== syscall.Syscall6
    13.44s     14.22s (flat, cum) 37.24% of Total
         .      520ms     477d00: CALL runtime.entersyscall(SB)
         .          .     477d05: MOVQ 0x10(SP), DI
         .          .     477d0a: MOVQ 0x18(SP), SI
         .          .     477d0f: MOVQ 0x20(SP), DX
         .          .     477d14: MOVQ 0x28(SP), R10
         .          .     477d19: MOVQ 0x30(SP), R8
         .          .     477d1e: MOVQ 0x38(SP), R9
         .          .     477d23: MOVQ 0x8(SP), AX
         .          .     477d28: SYSCALL
    13.42s     13.42s     477d2a: CMPQ $-0xfff, AX
         .          .     477d30: JBE 0x477d52
         .          .     477d32: MOVQ $-0x1, 0x40(SP)
         .          .     477d3b: MOVQ $0x0, 0x48(SP)
         .          .     477d44: NEGQ AX
         .          .     477d47: MOVQ AX, 0x50(SP)
         .          .     477d4c: CALL runtime.exitsyscall(SB)
         .          .     477d51: RET
         .          .     477d52: MOVQ AX, 0x40(SP)
      10ms       10ms     477d57: MOVQ DX, 0x48(SP)
         .          .     477d5c: MOVQ $0x0, 0x50(SP)
      10ms      270ms     477d65: CALL runtime.exitsyscall(SB)

miekg commented 7 years ago

UDP echo server in C (does not set the QR bit, just echo's bytes back): https://gist.github.com/miekg/a61d55a8ec6560ad6c4a2747b21e6128

Queryperf: 84343.155846 qps (some order of magnitude)

miekg commented 7 years ago

https://github.com/kavu/go_reuseport makes this somewhat faster in the Go case. Fluctuates in my localhost testing from 60K to 90K

johnbelamaric commented 7 years ago

From my colleague Kevin who is a veritable performance guru:

The first question I'd ask is if they're seeing 100% CPU.

If not, then they need to see if their path to sendmsg (including any
internal mutexes) has opportunites to block.

In C, we make sure we never do I/O with a lock.  Even with a single
thread though, a simple test does a hell of a lot more than that.

miekg commented 7 years ago

top - 14:09:17 up 1 day,  3:34,  2 users,  load average: 1.26, 0.35, 0.12
Tasks: 132 total,   1 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu0  : 17.8 us, 28.4 sy,  0.0 ni, 53.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  : 18.2 us, 29.6 sy,  0.0 ni, 52.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 19.9 us, 25.2 sy,  0.0 ni, 37.4 id,  0.0 wa,  0.0 hi, 17.5 si,  0.0 st
%Cpu3  : 14.6 us, 31.9 sy,  0.0 ni, 53.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 15398136 total, 13453972 free,   216784 used,  1727380 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14836896 avail Mem

not cpu bound.

Following some of Go code paths, they seem sane-ish, but all kind of stuff is happening because of the goroutine scheduling and so forth.

Thanks for the update

johnbelamaric commented 7 years ago

Does the simple echo server use go routines? It would be interesting to see the raw performance of a single-threaded server with no go routines.

miekg commented 7 years ago

That should be simple to do; yes it is using multiple Go routines to write the answer. The reading is done from the main thread. SO_REUSEPORT could be helpful there, but initial/small scale testing didn't see any improvements there

miekg commented 7 years ago

Geh, completely sequential, i.e. go serve() -> serve() yields better performance (still all in the 70-80 Kqps range), but more often higher, e.g. 90K

miekg commented 7 years ago

This is totally on-par with the C program now. On my (linode VPS) both progs go up to ~90Kqps with the queryperf as committed to this repo.

miekg commented 7 years ago

userspace <-> kernel space context switches are hard, let's go fishing?

miekg commented 7 years ago

I'm going to put NSD on these nodes tomorrow and check what it does. 'Cause I believe that we should reverse the testing; clearly this 80-100Kqps is the max this "hardware" can do. Let's optimize for that, i.e. cut out the syscall.Syscall of all perfs and see what's left.

Depends a bit on my NSD testing, if that yields 2x qps something is very wrong somewhere.

miekg commented 7 years ago

NSD is even more nonsensical and drops replies, so I get only 140 qps. Setup: nsd 4 (apt-get install) nsd.conf

server:
    port: 1053
    server-count: 1
    ip4-only: yes
    zonesdir: "/home/miek/zones"
    pidfile: "/home/miek/nsd-pid"
    database: "/home/miek/nsd.db"
    username: miek
remote-control:
    control-enable: no
zone:
    name: example.org
    zonefile: "/home/miek/zones/example.org"

nsd running with nsd -c nsd.conf -d

Think https://github.com/coredns/perf-tests/issues/7#issuecomment-300608263 is the most sensible way forward if we want to optimize performance.

miekg commented 7 years ago

So basic idea is this: get a udp baseline: with udpserv.go and/or the C program that does the same. Benchmark coredns on the same hardware, see how far off we are --> optimize.

achille-roussel commented 6 years ago

Hello @miekg

Did you ever find time to finish this up?

Edit: never mind, I missed the other open issues, seems like you moved on to using different setups, thanks for all the details!

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/perf-tests] GCE perf (..." ]

Hello @miekg

Did you ever find time to finish this up?

nope :(

coredns / perf-tests

GCE perf #7