Open gpl opened 2 years ago
Perhaps you are saturating the NIC throughput.
On Sun, Sep 4, 2022 at 9:27 PM Isogram @.***> wrote:
We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.
With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps.
With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.
We have the following corefile:
.:55 { file db.example.org example.org cache 100 whoami }
db.example.org
$ORIGIN example.org. @ 3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600 3600 IN NS a.iana-servers.net. 3600 IN NS b.iana-servers.net.
www IN A 127.0.0.1 IN AAAA ::1
We are using dnsperf: https://github.com/DNS-OARC/dnsperf
And the following command:
dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8
test.txt:
www.example.com AAAA
Is there anything we could be missing?
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/5595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIHRM4L7BEGHO6KORZ7GCDV4VZDHANCNFSM6AAAAAAQET65CM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I guess with local host that shouldn’t be the case.
On Sun, Sep 4, 2022 at 9:27 PM Isogram @.***> wrote:
We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.
With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps.
With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.
We have the following corefile:
.:55 { file db.example.org example.org cache 100 whoami }
db.example.org
$ORIGIN example.org. @ 3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600 3600 IN NS a.iana-servers.net. 3600 IN NS b.iana-servers.net.
www IN A 127.0.0.1 IN AAAA ::1
We are using dnsperf: https://github.com/DNS-OARC/dnsperf
And the following command:
dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8
test.txt:
www.example.com AAAA
Is there anything we could be missing?
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/5595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIHRM4L7BEGHO6KORZ7GCDV4VZDHANCNFSM6AAAAAAQET65CM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hello, If you could collect profiling data (CPU profile) exposed by pprof plugin, this could greatly benefit the investigation @gpl
Attaching profiles for gomaxprocs 1,2,4,8,16. coredns-gomaxprocs.zip
Based on a quick look at the profiles, it seems that most of the CPU time serving DNS requests was spent on writing responses to the client
Are you running dnsperf
tool on the same machine as CoreDNS instance is running? If that is the case then CoreDNS might be influenced by the dnsperf
tool as they are using shared resources like UDP sockets and OS might have trouble providing them to the CoreDNS and therefore a lot of time is spent in a syscalls, but this is only my wild guess.
Could be something like that.
Generally, if giving more CPU doesn't fix it, it is because you are hitting other bottlenecks. The question is whether those are in the CoreDNS code (for example, some mutex contention or somethign), or in the underlying OS or hardware. In this case it looks like writing to the UDP socket. Look into tuning UDP performance on your kernel. You may want to look at your UDP write buffer sizes, for example.
Hmm, I don't believe either of those are the issue here --
We had previously attempted to adjust a various number of kernel parameters and haven't seen any significant deviance in performance - additionally, from our telemetry I don't believe we're seeing any issues on that front.
Notably, the following values were adjusted on all hosts involved in this test:
net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=262144
net.core.wmem_max=262144
We've also adjusted net.ipv4.ip_local_port_range
just in case, to 1024-65k.
The tests were also run from various combination of hosts; and we observed the same results if the tests and server were on different hosts (identical hardware).
With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.
Does the same CPU usage means CoreDNS use up all 8-64 cores? If so, have you check whether those CPU usage was all from CoreDNS? For instance other process/system service can steal some of those CPU time.
Another idea is to measure CPU time in different categories(user, system, softirq, etc.). That can be helpful to find out the bottleneck.
Sorry for the lack of clarity - CoreDNS doesn't consume more than 4-5 cores.
I rebuilt coredns with symbols and ran perf
instead of pprof:
I tried off CPU analysis, the off CPU flame graph looks similar to perf's. With more than 4 CPUs assigned to CoreDNS, time spent in serveUDP increased significantly. Haven't got any clue though.
I did more digging after that, seems like the bottleneck is the network IO pattern.
CoreDNS starts 1 listener goroutine for each server instance, and creates 1 new goroutine for each new request. So we have a single-producer(reads request packets), and multi-consumers(handles requests and writes response packets) workflow.
With more CPUs assigned to the CoreDNS process, consumers' processing speed can scale correspondingly but producer's cannot. And when the Corefile only uses some light plugins, the consumers' job is relatively simple so the handling process doesn't need much CPU time. We can hit producer's limit under high load because it has only 1 goroutine, it cannot utilize more than 1 core of CPU.
I ran some tests with the following Corefile on my laptop:
.:55 {
file db.example.org example.org
cache 100
whoami
}
.:56 {
file db.example.org example.org
cache 100
whoami
}
tests:
:55
, total QPS ~110k.:55
, total QPS ~110k.:55
and :56
respectively, total QPS ~190k(and CoreDNS's CPU usage increased significantly).Interesting. Any proposal for improvement?
I could try to find a way. But I do agree with the idea of redis team: scaling horizontally is paramount
, and CoreDNS can scale horizontally pretty well. So it's not a critical issue that it doesn't scale vertically.
PS: @Lobshunter86 is me, too.
I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?
I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?
Please go ahead and have fun😉. I have been occupied at work recently.
I threw the UDP message read within miekg/dns into a Goroutine pool. results are ok. Without my change:
Queries sent: 13988687
Queries completed: 13988288 (100.00%)
Queries lost: 300 (0.00%)
Queries interrupted: 99 (0.00%)
Response codes: NOERROR 13988288 (100.00%)
Average packet size: request 32, response 100
Run time (s): 105.433637
Queries per second: 132673.863845
Average Latency (s): 0.000627 (min 0.000023, max 0.022370)
Latency StdDev (s): 0.000151
CPU utilization ~420%
With my change:
Queries sent: 5735429
Queries completed: 5735336 (100.00%)
Queries lost: 0 (0.00%)
Queries interrupted: 93 (0.00%)
Response codes: NOERROR 5735336 (100.00%)
Average packet size: request 32, response 100
Run time (s): 29.624372
Queries per second: 193601.943697
Average Latency (s): 0.000483 (min 0.000014, max 0.009152)
Latency StdDev (s): 0.000336
CPU utilization ~560%
So notably the CPU utilization went up, but QPS went up ~50%, and avg latency went down semi-significantly. I'll have to do some profiling later. I wonder why the latency stddev went up :thinking: https://github.com/golang/go/issues/45886 should help if UDPConn's readmsg is the bottleneck
In my understanding, https://github.com/golang/go/issues/45886 should improve the performance of UDP long connections(i.e. read a bunch of data from the same UDP socket, like QUIC). Would it help improve DNS workload? Since every DNS request-response belongs to different socket.
This CloudFlare blog post seems keenly relevant to this issue: Go, don't collect my garbage
The author describes a performance puzzle very similar to what was described in the first post - namely, 1-4 cores works well, with quickly diminishing returns with higher concurrency. He achieved vastly improved performance by experimenting with Go Memory Garbage Collection tuning using the GOGC
environment variable (a.k.a runtime/debug.SetGCPercent
function).
SetGCPercent
sets the garbage collection target percentage: a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage.SetGCPercent
returns the previous setting. The initial setting is the value of theGOGC
environment variable at startup, or100
if the variable is not set. This setting may be effectively reduced in order to maintain a memory limit. A negative percentage effectively disables garbage collection, unless the memory limit is reached. See SetMemoryLimit for more details.
Before GOGC
tuning, # of goroutines / Ops/s:
After GOGC
tuning, # of goroutines / Ops/s:
One caveat, is that his performance benchmark only ran for 10 seconds, which may have skewed the results in unexpected ways.
The challenge with this is that it's highly hardware dependent, so there's no one "right answer" for setting the GOGC
value that would fit every user for every scenario.
@gpl perhaps performing some tuning of the GOGC
environment variable in same manner as the blog post above may yield positive results?
P.S. Excellent additional/background reading on Go Garbage Collection: A Guide to the Go Garbage Collector
A memo: I found an interesting approach that uses SO_REUSEPORT
and multiple net.ListenUDP
call. According to the author's benchmark, it outperforms the solution of single listen, multiple ReadFromUDP.
I shall give it a try when I got time.
yes @lobshunter that is correct. I think lwn article explain the improvements and few caveats (esp. with TCP) of using SO_REUSEPORT
option. Last week, I had validated the improvements by simply starting multiple servers on same port (as we've already set above option at ListenPacket
as seen here) after making following code changes:
diff --git a/core/dnsserver/register.go b/core/dnsserver/register.go
index 8de55906..ac581eca 100644
--- a/core/dnsserver/register.go
+++ b/core/dnsserver/register.go
@@ -3,6 +3,8 @@ package dnsserver
import (
"fmt"
"net"
+ "os"
+ "strconv"
"time"
"github.com/coredns/caddy"
@@ -157,36 +159,43 @@ func (h *dnsContext) MakeServers() ([]caddy.Server, error) {
}
// then we create a server for each group
var servers []caddy.Server
- for addr, group := range groups {
- // switch on addr
- switch tr, _ := parse.Transport(addr); tr {
- case transport.DNS:
- s, err := NewServer(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
- case transport.TLS:
- s, err := NewServerTLS(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ numSock, err := strconv.ParseInt(os.Getenv("NUM_SOCK"), 10, 64)
+ if err != nil {
+ numSock = 1
+ }
+ for i := 0; i < int(numSock); i++ {
+ for addr, group := range groups {
+ // switch on addr
+ switch tr, _ := parse.Transport(addr); tr {
+ case transport.DNS:
+ s, err := NewServer(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
- case transport.GRPC:
- s, err := NewServergRPC(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ case transport.TLS:
+ s, err := NewServerTLS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
- case transport.HTTPS:
- s, err := NewServerHTTPS(addr, group)
- if err != nil {
- return nil, err
+ case transport.GRPC:
+ s, err := NewServergRPC(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
+
+ case transport.HTTPS:
+ s, err := NewServerHTTPS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
}
- servers = append(servers, s)
}
}
Essentially, I've just exposed an env var NUM_SOCK
representing no. of socket (thereby servers
) one wants to use for serving requests. For validating the improvements, I've used similar Corefile as mentioned at issue description above:
.:55 {
file db.example.org example.org
cache 100
whoami
}
1. With single listen socket, I'm able to achieve ~130K qps throughput from dnsperf on some private cloud instance.
$ NUM_SOCK=1 taskset -c 2-35 ./coredns-fix
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
Queries sent: 5919568
Queries completed: 5919470 (100.00%)
Queries lost: 0 (0.00%)
Queries interrupted: 98 (0.00%)
Response codes: NOERROR 5919470 (100.00%)
Average packet size: request 33, response 103
Run time (s): 45.693927
Queries per second: 129546.099200
Average Latency (s): 0.000756 (min 0.000016, max 0.006743)
Latency StdDev (s): 0.000400
CoreDNS CPU Utilization: 275%
DNS Perf CPU Utilization: 480%
2. With two listen socket, I'm able to achieve ~235K qps throughput from dnsperf.
$ NUM_SOCK=2 taskset -c 2-35 ./coredns-fix
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ ss -u -a | grep 55
UNCONN 0 0 *:55 *:*
UNCONN 0 0 *:55 *:*
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
Queries sent: 17760093
Queries completed: 17759997 (100.00%)
Queries lost: 0 (0.00%)
Queries interrupted: 96 (0.00%)
Response codes: NOERROR 17759997 (100.00%)
Average packet size: request 33, response 103
Run time (s): 75.404526
Queries per second: 235529.588768
Average Latency (s): 0.000411 (min 0.000018, max 0.006754)
Latency StdDev (s): 0.000379
CoreDNS CPU Utilization: 570%
DNS Perf CPU Utilization: 780%
3. With 4 listen socket, I'm able to achieve ~400K qps throughput from dnsperf.
$ NUM_SOCK=4 taskset -c 2-35 ./coredns-fix
.:55
.:55
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3
$ ss -u -a | grep 55
UNCONN 0 0 *:55 *:*
UNCONN 0 0 *:55 *:*
UNCONN 0 0 *:55 *:*
UNCONN 0 0 *:55 *:*
$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
Queries sent: 20535534
Queries completed: 20535443 (100.00%)
Queries lost: 0 (0.00%)
Queries interrupted: 91 (0.00%)
Response codes: NOERROR 20535443 (100.00%)
Average packet size: request 33, response 103
Run time (s): 51.342591
Queries per second: 399968.965337
Average Latency (s): 0.000235 (min 0.000020, max 0.003655)
Latency StdDev (s): 0.000197
CoreDNS CPU Utilization: 1371%
DNS Perf CPU Utilization: 1191%
So, I think bottleneck was indeed due to throughput limitation on single socket & we are able to scale throughput almost linearly as we increase no. of listen socket. I'll create a pull request after validating the tcp traffic (non tls based) as I gets some more time. Thanks.
@iyashu Excellent productivity 👍.
@iyashu Really looking forward for this PR
@iyashu any update here? this is pretty awesome
I ran into the same problem and checked the solution suggested by @iyashu.
Yes, it works, but there are a few problems.
I used a similar Corefile, but with reload 3s
.
.:55 {
reload 3s
file db.example.org example.org
cache 100
whoami
}
NUM_SOCK = 5 before reloading:
Queries sent: 3055127
Queries completed: 3055127 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 3055127 (100.00%)
Average packet size: request 33, response 103
Run time (s): 30.000282
Queries per second: 101836.609403
Average Latency (s): 0.000923 (min 0.000025, max 0.006609)
Latency StdDev (s): 0.000226
NUM_SOCK = 5 after reloading:
Queries sent: 2002002
Queries completed: 2002002 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 2002002 (100.00%)
Average packet size: request 33, response 103
Run time (s): 30.001015
Queries per second: 66731.142263
Average Latency (s): 0.001337 (min 0.000019, max 0.056923)
Latency StdDev (s): 0.001847
I changed reload 3s
to reload 4s
, so it shouldn't have affected the result. But it affects.
I compared full goroutine stack dumps before and after reloading and I didn't find any differences. Yes, the goroutines have been restarted, but they are the same and their number is the same. There are still 5 goroutines running for UDP and 5 for TCP: goroutines-before-reload.txt goroutines-after-reload.txt
At the same time, with NUM_SOCK = 1, there is no such problem. The issue is reproducible only if we have multiple servers running on the same port. I also noticed that only the first reload affects performance. The second and next reloads do not affect the result.
I also did tests with different NUM_SOCKS on environment with 96 CPU and got the following results: | NUM_SOCK | QPS | CoreDNS %CPU | DNSPerf %CPU |
---|---|---|---|---|
1 | 29033 | 320 | 323 | |
2 | 50374 | 631 | 545 | |
3 | 70691 | 942 | 733 | |
4 | 95227 | 1128 | 880 | |
5 | 113599 | 1505 | 951 | |
6 | 192552 | 1226 | 999 | |
7 | 219938 | 1954 | 1023 | |
8 | 232307 | 1703 | 1009 | |
9 | 222798 | 2394 | 930 | |
10 | 211981 | 2573 | 886 | |
11 | 205937 | 2433 | 855 | |
12 | 192795 | 2369 | 823 |
The best result was achieved with NUM_SOCK = 8, then the QPS began to decrease. Yes, 232307 QPS is much better than the initial 29033, but it takes only 17 cores out of 96. Increasing the NUM_SOCK leads to a decrease in performance. This means that there is a bottleneck somewhere else.
I looked at the profiles with different NUM_SOCK, but did not see any obvious reasons why the performance is decreasing.
And the bottleneck is definitely not on my environment. I running 2 replicas of CoreDNS with NUM_SOCK = 8 on the same environment and I ran 2 dnsperf tests in parallel. I got 280k+280k QPS, despite the fact that one replica with NUM_SOCK = 8 gave 220k QPS. Which is actually surprising. Where did another 50k (280k-230k) come from? :)
Rather than scaling NUM_SOCK within a single instance I would expect better performance with core pinning and a separate CoreDNS instance per core.
Rather than scaling NUM_SOCK within a single instance I would expect better performance with core pinning and a separate CoreDNS instance per core.
Sure. That's more management overhead though. And this PR wouldn't preclude doing that if you want to...
Rather than scaling NUM_SOCK within a single instance I would expect better performance with core pinning and a separate CoreDNS instance per core.
In our individual/specific case: Per 32c/64t core server, we'd need to run 32x the number of instances, which would open 32x the watches, occupy roughly 32x the memory, 32x the number of kubernetes pods, 32x the lb entries, etc. Across 100 machines, this would be 3200x the load on every component.
( It's not fully clear to me if the reuseport plugin proposed above would also have this amplification issue, though. )
Hello :)
I described here https://github.com/coredns/coredns/issues/5595#issuecomment-2329195223 that we have an issue with performance after reloading. This problem is fixed https://github.com/coredns/caddy/pull/6.
ss -ulpn | grep 54
helped up to understand what is a problem. We found that an issue with reloading logic that works bad with multiple servers on the same port.
Single server: 29033 QPS ss -ulpn returned 1 socket
UNCONN 0 0 *:55 *:* users:(("main",pid=2381276,fd=3))
5 numsockets: 101836 QPS ss -ulpn returned 5 sockets:
UNCONN 0 0 *:55 *:* users:(("main",pid=2393767,fd=3))
UNCONN 0 0 *:55 *:* users:(("main",pid=2393767,fd=6))
UNCONN 0 0 *:55 *:* users:(("main",pid=2393767,fd=7))
UNCONN 0 0 *:55 *:* users:(("main",pid=2393767,fd=8))
UNCONN 0 0 *:55 *:* users:(("main",pid=2393767,fd=9))
5 numsockets after reloading: 66731 QPS ss -ulpn returned 1 socket with multiple fd for the same process:
UNCONN 0 0 *:55 *:* users:(("main",pid=2387921,fd=11),("main",pid=2387921,fd=10),("main",pid=2387921,fd=9),("main",pid=2387921,fd=8),("main",pid=2387921,fd=7))
You can see here, that 5 numsockets after reloading
works better than Single server
, but still use single socket. So, I decided to check how we can improve performance without SO_REUSEPORT.
I did it on "clean" DNS server without coreDNS. I prepared main.go and performed such tests:
1 server | 2 servers | 3 servers |
---|---|---|
49327 QPS | 47855 QPS | 47066 QPS |
So, if we will read the same connection in parallel, it doesn't perform better.
1 fd | 2 fd | 3 fd | 4 fd | 5 fd | 6 fd | 7 fd |
---|---|---|---|---|---|---|
49327 QPS | 86485 QPS | 99720 QPS | 109206 QPS | 110077 QPS | 113787 QPS | 114913 QPS |
after 3-4 fd it does not give a significant increase. But it is possible x2 QPS using this approach. From code perspective it looks strange, because in go we cannot create socket file directly, this method is private and because of this we need to create PacketConn, than take a File and create another PacketConn using the same File.
1 socket | 2 sockets | 3 sockets | 4 sockets | 5 sockets | 6 sockets | 7 sockets | 8 sockets |
---|---|---|---|---|---|---|---|
49585 QPS | 83933 QPS | 114579 QPS | 139543 QPS | 167460 QPS | 195156 QPS | 201934 QPS | 235415 QPS |
UNCONN 0 0 *:55 *:* users:(("main",pid=2463974,fd=14),("main",pid=2463974,fd=12),("main",pid=2463974,fd=3))
UNCONN 0 0 *:55 *:* users:(("main",pid=2463974,fd=13),("main",pid=2463974,fd=9),("main",pid=2463974,fd=6))
UNCONN 0 0 *:55 *:* users:(("main",pid=2463974,fd=11),("main",pid=2463974,fd=10),("main",pid=2463974,fd=7))
I thought I'd get more, but...
2 sockets | 3 sockets | 4 sockets | 5 sockets | 6 sockets | |
---|---|---|---|---|---|
2 fd | 150239 QPS | 223003 QPS | 214157 QPS | 219193 QPS | 213491 QPS |
3 fd | 199485 QPS | 199422 QPS | 194780 QPS | - | - |
4 fd | 187982 QPS | 188690 QPS | - | - | - |
So based on 1 and 2 it looks like we have some problems with reading from the file. Parallel reading does not give an increase, but working with different file descriptors for the same file does. However, I do not know how concurrency works in that case. Do we have some locks while reading from file in that case and where?
Method 2 can be used if it is not possible to enable SO_REUSEPORT. However, in other cases, the solution with SO_REUSEPORT looks better from code perspective and works better. The combination of these methods does not provide any improvements at all.
I am posting the results here for history. Maybe someone will have some ideas what the problem might be and can figure out how to improve performance without running multiple servers on the same port and using SO_REUSEPORT.
However, I do not know how concurrency works in that case. Do we have some locks while reading from file in that case and where?
Perhaps this stacktrace would be helpful @Shmillerov: https://github.com/coredns/coredns/issues/6573#issuecomment-2358900049, if I'm correctly understanding which locks you're referring to (or lack thereof? sometimes? I'd have to investigate more).
However, I do not know how concurrency works in that case. Do we have some locks while reading from file in that case and where?
Perhaps this stacktrace would be helpful @Shmillerov: #6573 (comment), if I'm correctly understanding which locks you're referring to (or lack thereof? sometimes? I'd have to investigate more).
Based on your stacktrace, we have a lock for FD poll.(*FD).writeLock
. You have single socket file, single FD and lock works great for your case.
But in my test I have a single socket file and multiple FD. And it looks like we doesn't have any locks in that case between FD. It's not clear how this approach will work from concurrency perspective.
By the way, we decided to use numsockets approach. Based on this test https://github.com/coredns/coredns/pull/6882#discussion_r1818522596 we decided that it works pretty good. With numsockets and SO_REUSEPORT we will have multiple sockets and single FD for each socket. Kernel will distribute load between these sockets and we doesn't have any issues with concurrency in that case.
We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.
With GOMAXPROCS set to 1, we observe ~60k qps and full utilization of one core.
With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps, but it consumes almost entirely two cores.
With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.
With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.
We have the following corefile:
db.example.org
We are using
dnsperf
: https://github.com/DNS-OARC/dnsperfAnd the following command:
test.txt:
Is there anything we could be missing?
Thanks!