coredns doesn't perform better despite having more cores

gpl commented 2 years ago

We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.

With GOMAXPROCS set to 1, we observe ~60k qps and full utilization of one core.

With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps, but it consumes almost entirely two cores.

With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

We have the following corefile:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

db.example.org

$ORIGIN example.org.
@       3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600
        3600 IN NS a.iana-servers.net.
        3600 IN NS b.iana-servers.net.

www     IN A     127.0.0.1
        IN AAAA  ::1

We are using dnsperf: https://github.com/DNS-OARC/dnsperf

And the following command:

  dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8

test.txt:

www.example.com AAAA

Is there anything we could be missing?

Thanks!

johnbelamaric commented 2 years ago

Perhaps you are saturating the NIC throughput.

On Sun, Sep 4, 2022 at 9:27 PM Isogram @.***> wrote:

We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.

With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps.

With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.

We have the following corefile:

.:55 { file db.example.org example.org cache 100 whoami }

db.example.org

$ORIGIN example.org. @ 3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600 3600 IN NS a.iana-servers.net. 3600 IN NS b.iana-servers.net.

www IN A 127.0.0.1 IN AAAA ::1

We are using dnsperf: https://github.com/DNS-OARC/dnsperf

And the following command:

dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8

test.txt:

www.example.com AAAA

Is there anything we could be missing?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/5595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIHRM4L7BEGHO6KORZ7GCDV4VZDHANCNFSM6AAAAAAQET65CM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

johnbelamaric commented 2 years ago

I guess with local host that shouldn’t be the case.

On Sun, Sep 4, 2022 at 9:27 PM Isogram @.***> wrote:

We are running CoreDNS 1.9.3 (retrieved from the official releases on GitHub), and have been having difficulty with increasing performance of a single instance of coredns.

With GOMAXPROCS set to 2, we seem to hit a performance limit of ~90-100k qps.

With GOMAXPROCS set to 4, we observe that coredns will use all 4 cores - but throughput does not increase, and latency seems to be the same.

We have the following corefile:

.:55 { file db.example.org example.org cache 100 whoami }

db.example.org

$ORIGIN example.org. @ 3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. 2017042745 7200 3600 1209600 3600 3600 IN NS a.iana-servers.net. 3600 IN NS b.iana-servers.net.

www IN A 127.0.0.1 IN AAAA ::1

We are using dnsperf: https://github.com/DNS-OARC/dnsperf

And the following command:

dnsperf -d test.txt -s 127.0.0.1 -p 55 -Q 10000000 -c 1 -l 10000000 -S .1 -t 8

test.txt:

www.example.com AAAA

Is there anything we could be missing?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/5595, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACIHRM4L7BEGHO6KORZ7GCDV4VZDHANCNFSM6AAAAAAQET65CM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Tantalor93 commented 2 years ago

Hello, If you could collect profiling data (CPU profile) exposed by pprof plugin, this could greatly benefit the investigation @gpl

gpl commented 2 years ago

Attaching profiles for gomaxprocs 1,2,4,8,16. coredns-gomaxprocs.zip

Tantalor93 commented 2 years ago

Based on a quick look at the profiles, it seems that most of the CPU time serving DNS requests was spent on writing responses to the client

Are you running dnsperf tool on the same machine as CoreDNS instance is running? If that is the case then CoreDNS might be influenced by the dnsperf tool as they are using shared resources like UDP sockets and OS might have trouble providing them to the CoreDNS and therefore a lot of time is spent in a syscalls, but this is only my wild guess.

johnbelamaric commented 2 years ago

Could be something like that.

Generally, if giving more CPU doesn't fix it, it is because you are hitting other bottlenecks. The question is whether those are in the CoreDNS code (for example, some mutex contention or somethign), or in the underlying OS or hardware. In this case it looks like writing to the UDP socket. Look into tuning UDP performance on your kernel. You may want to look at your UDP write buffer sizes, for example.

gpl commented 2 years ago

Hmm, I don't believe either of those are the issue here --

We had previously attempted to adjust a various number of kernel parameters and haven't seen any significant deviance in performance - additionally, from our telemetry I don't believe we're seeing any issues on that front.

Notably, the following values were adjusted on all hosts involved in this test:

net.core.rmem_default=262144
net.core.wmem_default=262144
net.core.rmem_max=262144
net.core.wmem_max=262144

We've also adjusted net.ipv4.ip_local_port_range just in case, to 1024-65k.

The tests were also run from various combination of hosts; and we observed the same results if the tests and server were on different hosts (identical hardware).

lobshunter commented 2 years ago

With GOMAXPROCS set to 8-64, we observe the same CPU usage and throughput.

Does the same CPU usage means CoreDNS use up all 8-64 cores? If so, have you check whether those CPU usage was all from CoreDNS? For instance other process/system service can steal some of those CPU time.

Another idea is to measure CPU time in different categories(user, system, softirq, etc.). That can be helpful to find out the bottleneck.

gpl commented 2 years ago

Sorry for the lack of clarity - CoreDNS doesn't consume more than 4-5 cores.

I rebuilt coredns with symbols and ran perf instead of pprof:

coredns-c02-flamegraph

lobshunter commented 2 years ago

I tried off CPU analysis, the off CPU flame graph looks similar to perf's. With more than 4 CPUs assigned to CoreDNS, time spent in serveUDP increased significantly. Haven't got any clue though.

Lobshunter86 commented 1 year ago

I did more digging after that, seems like the bottleneck is the network IO pattern.

CoreDNS starts 1 listener goroutine for each server instance, and creates 1 new goroutine for each new request. So we have a single-producer(reads request packets), and multi-consumers(handles requests and writes response packets) workflow.

With more CPUs assigned to the CoreDNS process, consumers' processing speed can scale correspondingly but producer's cannot. And when the Corefile only uses some light plugins, the consumers' job is relatively simple so the handling process doesn't need much CPU time. We can hit producer's limit under high load because it has only 1 goroutine, it cannot utilize more than 1 core of CPU.

I ran some tests with the following Corefile on my laptop:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

.:56 {
  file db.example.org example.org
  cache 100
  whoami
}

tests:

1 dnsperf process, sent requests to :55, total QPS ~110k.
2 dnsperf processes, both sent requests to :55, total QPS ~110k.
2 dnsperf processes, sent requests to :55 and :56 respectively, total QPS ~190k(and CoreDNS's CPU usage increased significantly).

johnbelamaric commented 1 year ago

Interesting. Any proposal for improvement?

lobshunter commented 1 year ago

I could try to find a way. But I do agree with the idea of redis team: scaling horizontally is paramount, and CoreDNS can scale horizontally pretty well. So it's not a critical issue that it doesn't scale vertically.

PS: @Lobshunter86 is me, too.

horahoradev commented 1 year ago

I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?

lobshunter commented 1 year ago

I'm also interested in this issue; I haven't contributed, but this sounds fun to work on. Could I claim this?

Please go ahead and have fun😉. I have been occupied at work recently.

horahoradev commented 1 year ago

I threw the UDP message read within miekg/dns into a Goroutine pool. results are ok. Without my change:

  Queries sent:         13988687
  Queries completed:    13988288 (100.00%)
  Queries lost:         300 (0.00%)
  Queries interrupted:  99 (0.00%)

  Response codes:       NOERROR 13988288 (100.00%)
  Average packet size:  request 32, response 100
  Run time (s):         105.433637
  Queries per second:   132673.863845

  Average Latency (s):  0.000627 (min 0.000023, max 0.022370)
  Latency StdDev (s):   0.000151

CPU utilization ~420%

With my change:

  Queries sent:         5735429
  Queries completed:    5735336 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  93 (0.00%)

  Response codes:       NOERROR 5735336 (100.00%)
  Average packet size:  request 32, response 100
  Run time (s):         29.624372
  Queries per second:   193601.943697

  Average Latency (s):  0.000483 (min 0.000014, max 0.009152)
  Latency StdDev (s):   0.000336

CPU utilization ~560%

So notably the CPU utilization went up, but QPS went up ~50%, and avg latency went down semi-significantly. I'll have to do some profiling later. I wonder why the latency stddev went up :thinking: https://github.com/golang/go/issues/45886 should help if UDPConn's readmsg is the bottleneck

lobshunter commented 1 year ago

In my understanding, https://github.com/golang/go/issues/45886 should improve the performance of UDP long connections(i.e. read a bunch of data from the same UDP socket, like QUIC). Would it help improve DNS workload? Since every DNS request-response belongs to different socket.

rrrix commented 1 year ago

This CloudFlare blog post seems keenly relevant to this issue: Go, don't collect my garbage

The author describes a performance puzzle very similar to what was described in the first post - namely, 1-4 cores works well, with quickly diminishing returns with higher concurrency. He achieved vastly improved performance by experimenting with Go Memory Garbage Collection tuning using the GOGC environment variable (a.k.a runtime/debug.SetGCPercent function).

SetGCPercent sets the garbage collection target percentage: a collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. SetGCPercent returns the previous setting. The initial setting is the value of the GOGC environment variable at startup, or 100 if the variable is not set. This setting may be effectively reduced in order to maintain a memory limit. A negative percentage effectively disables garbage collection, unless the memory limit is reached. See SetMemoryLimit for more details.

Before GOGC tuning, # of goroutines / Ops/s:

After GOGC tuning, # of goroutines / Ops/s:

One caveat, is that his performance benchmark only ran for 10 seconds, which may have skewed the results in unexpected ways.

The challenge with this is that it's highly hardware dependent, so there's no one "right answer" for setting the GOGC value that would fit every user for every scenario.

@gpl perhaps performing some tuning of the GOGC environment variable in same manner as the blog post above may yield positive results?

P.S. Excellent additional/background reading on Go Garbage Collection: A Guide to the Go Garbage Collector

lobshunter commented 1 year ago

A memo: I found an interesting approach that uses SO_REUSEPORT and multiple net.ListenUDP call. According to the author's benchmark, it outperforms the solution of single listen, multiple ReadFromUDP.

I shall give it a try when I got time.

iyashu commented 1 year ago

yes @lobshunter that is correct. I think lwn article explain the improvements and few caveats (esp. with TCP) of using SO_REUSEPORT option. Last week, I had validated the improvements by simply starting multiple servers on same port (as we've already set above option at ListenPacket as seen here) after making following code changes:

diff --git a/core/dnsserver/register.go b/core/dnsserver/register.go
index 8de55906..ac581eca 100644
--- a/core/dnsserver/register.go
+++ b/core/dnsserver/register.go
@@ -3,6 +3,8 @@ package dnsserver
 import (
  "fmt"
  "net"
+ "os"
+ "strconv"
  "time"

  "github.com/coredns/caddy"
@@ -157,36 +159,43 @@ func (h *dnsContext) MakeServers() ([]caddy.Server, error) {
  }
  // then we create a server for each group
  var servers []caddy.Server
- for addr, group := range groups {
- // switch on addr
- switch tr, _ := parse.Transport(addr); tr {
- case transport.DNS:
- s, err := NewServer(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)

- case transport.TLS:
- s, err := NewServerTLS(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ numSock, err := strconv.ParseInt(os.Getenv("NUM_SOCK"), 10, 64)
+ if err != nil {
+ numSock = 1
+ }
+ for i := 0; i < int(numSock); i++ {
+ for addr, group := range groups {
+ // switch on addr
+ switch tr, _ := parse.Transport(addr); tr {
+ case transport.DNS:
+ s, err := NewServer(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.GRPC:
- s, err := NewServergRPC(addr, group)
- if err != nil {
- return nil, err
- }
- servers = append(servers, s)
+ case transport.TLS:
+ s, err := NewServerTLS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)

- case transport.HTTPS:
- s, err := NewServerHTTPS(addr, group)
- if err != nil {
- return nil, err
+ case transport.GRPC:
+ s, err := NewServergRPC(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
+
+ case transport.HTTPS:
+ s, err := NewServerHTTPS(addr, group)
+ if err != nil {
+ return nil, err
+ }
+ servers = append(servers, s)
  }
- servers = append(servers, s)
  }
  }

Essentially, I've just exposed an env var NUM_SOCK representing no. of socket (thereby servers) one wants to use for serving requests. For validating the improvements, I've used similar Corefile as mentioned at issue description above:

.:55 {
  file db.example.org example.org
  cache 100
  whoami
}

1. With single listen socket, I'm able to achieve ~130K qps throughput from dnsperf on some private cloud instance.

$ NUM_SOCK=1 taskset -c 2-35 ./coredns-fix
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3

$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         5919568
  Queries completed:    5919470 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  98 (0.00%)

  Response codes:       NOERROR 5919470 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         45.693927
  Queries per second:   129546.099200

  Average Latency (s):  0.000756 (min 0.000016, max 0.006743)
  Latency StdDev (s):   0.000400

CoreDNS CPU Utilization: 275%
DNS Perf CPU Utilization: 480%

2. With two listen socket, I'm able to achieve ~235K qps throughput from dnsperf.

$ NUM_SOCK=2 taskset -c 2-35 ./coredns-fix
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3

$ ss -u -a | grep 55
UNCONN 0      0                      *:55                *:*
UNCONN 0      0                      *:55                *:*

$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         17760093
  Queries completed:    17759997 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  96 (0.00%)

  Response codes:       NOERROR 17759997 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         75.404526
  Queries per second:   235529.588768

  Average Latency (s):  0.000411 (min 0.000018, max 0.006754)
  Latency StdDev (s):   0.000379

CoreDNS CPU Utilization: 570%
DNS Perf CPU Utilization: 780%

3. With 4 listen socket, I'm able to achieve ~400K qps throughput from dnsperf.

$ NUM_SOCK=4 taskset -c 2-35 ./coredns-fix
.:55
.:55
.:55
.:55
CoreDNS-1.10.1
linux/amd64, go1.19.3

$ ss -u -a | grep 55
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*
UNCONN 0      0                        *:55                *:*

$ taskset -c 38-71 dnsperf -d test.txt -s 127.0.0.1 -p 55 -c 1000 -l 100000 -S .1 -T 16
  Queries sent:         20535534
  Queries completed:    20535443 (100.00%)
  Queries lost:         0 (0.00%)
  Queries interrupted:  91 (0.00%)

  Response codes:       NOERROR 20535443 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         51.342591
  Queries per second:   399968.965337

  Average Latency (s):  0.000235 (min 0.000020, max 0.003655)
  Latency StdDev (s):   0.000197

CoreDNS CPU Utilization: 1371%
DNS Perf CPU Utilization: 1191%

So, I think bottleneck was indeed due to throughput limitation on single socket & we are able to scale throughput almost linearly as we increase no. of listen socket. I'll create a pull request after validating the tcp traffic (non tls based) as I gets some more time. Thanks.

lobshunter commented 1 year ago

@iyashu Excellent productivity 👍.

crliu3227 commented 1 year ago

@iyashu Really looking forward for this PR

johnbelamaric commented 3 months ago

@iyashu any update here? this is pretty awesome

Shmillerov commented 1 month ago

I ran into the same problem and checked the solution suggested by @iyashu.

Yes, it works, but there are a few problems.

Performance degrades after configuration reloading

I used a similar Corefile, but with reload 3s.

.:55 {
  reload 3s
  file db.example.org example.org
  cache 100
  whoami
}

NUM_SOCK = 5 before reloading:

  Queries sent:         3055127
  Queries completed:    3055127 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 3055127 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         30.000282
  Queries per second:   101836.609403

  Average Latency (s):  0.000923 (min 0.000025, max 0.006609)
  Latency StdDev (s):   0.000226

NUM_SOCK = 5 after reloading:

  Queries sent:         2002002
  Queries completed:    2002002 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 2002002 (100.00%)
  Average packet size:  request 33, response 103
  Run time (s):         30.001015
  Queries per second:   66731.142263

  Average Latency (s):  0.001337 (min 0.000019, max 0.056923)
  Latency StdDev (s):   0.001847

I changed reload 3s to reload 4s, so it shouldn't have affected the result. But it affects.

I compared full goroutine stack dumps before and after reloading and I didn't find any differences. Yes, the goroutines have been restarted, but they are the same and their number is the same. There are still 5 goroutines running for UDP and 5 for TCP: goroutines-before-reload.txt goroutines-after-reload.txt

At the same time, with NUM_SOCK = 1, there is no such problem. The issue is reproducible only if we have multiple servers running on the same port. I also noticed that only the first reload affects performance. The second and next reloads do not affect the result.

Performance degrades with NUM_SOCK > 8

I also did tests with different NUM_SOCKS on environment with 96 CPU and got the following results:	NUM_SOCK	QPS	CoreDNS %CPU
1	29033	320	323
2	50374	631	545
3	70691	942	733
4	95227	1128	880
5	113599	1505	951
6	192552	1226	999
7	219938	1954	1023
8	232307	1703	1009
9	222798	2394	930
10	211981	2573	886
11	205937	2433	855
12	192795	2369	823

The best result was achieved with NUM_SOCK = 8, then the QPS began to decrease. Yes, 232307 QPS is much better than the initial 29033, but it takes only 17 cores out of 96. Increasing the NUM_SOCK leads to a decrease in performance. This means that there is a bottleneck somewhere else.

I looked at the profiles with different NUM_SOCK, but did not see any obvious reasons why the performance is decreasing.

And the bottleneck is definitely not on my environment. I running 2 replicas of CoreDNS with NUM_SOCK = 8 on the same environment and I ran 2 dnsperf tests in parallel. I got 280k+280k QPS, despite the fact that one replica with NUM_SOCK = 8 gave 220k QPS. Which is actually surprising. Where did another 50k (280k-230k) come from? :)

jameshartig commented 1 month ago

Rather than scaling NUM_SOCK within a single instance I would expect better performance with core pinning and a separate CoreDNS instance per core.

johnbelamaric commented 1 month ago

Rather than scaling NUM_SOCK within a single instance I would expect better performance with core pinning and a separate CoreDNS instance per core.

Sure. That's more management overhead though. And this PR wouldn't preclude doing that if you want to...

gpl commented 1 month ago

Rather than scaling NUM_SOCK within a single instance I would expect better performance with core pinning and a separate CoreDNS instance per core.

In our individual/specific case: Per 32c/64t core server, we'd need to run 32x the number of instances, which would open 32x the watches, occupy roughly 32x the memory, 32x the number of kubernetes pods, 32x the lb entries, etc. Across 100 machines, this would be 3200x the load on every component.

( It's not fully clear to me if the reuseport plugin proposed above would also have this amplification issue, though. )

Shmillerov commented 2 weeks ago

Hello :)

I described here https://github.com/coredns/coredns/issues/5595#issuecomment-2329195223 that we have an issue with performance after reloading. This problem is fixed https://github.com/coredns/caddy/pull/6.

ss -ulpn | grep 54 helped up to understand what is a problem. We found that an issue with reloading logic that works bad with multiple servers on the same port.

Single server: 29033 QPS ss -ulpn returned 1 socket

UNCONN 0      0                        *:55               *:*    users:(("main",pid=2381276,fd=3))

5 numsockets: 101836 QPS ss -ulpn returned 5 sockets:

UNCONN 0      0                        *:55               *:*    users:(("main",pid=2393767,fd=3))            
UNCONN 0      0                        *:55               *:*    users:(("main",pid=2393767,fd=6))            
UNCONN 0      0                        *:55               *:*    users:(("main",pid=2393767,fd=7))            
UNCONN 0      0                        *:55               *:*    users:(("main",pid=2393767,fd=8))            
UNCONN 0      0                        *:55               *:*    users:(("main",pid=2393767,fd=9))

5 numsockets after reloading: 66731 QPS ss -ulpn returned 1 socket with multiple fd for the same process:

UNCONN 0      0                        *:55               *:*    users:(("main",pid=2387921,fd=11),("main",pid=2387921,fd=10),("main",pid=2387921,fd=9),("main",pid=2387921,fd=8),("main",pid=2387921,fd=7))

You can see here, that 5 numsockets after reloading works better than Single server, but still use single socket. So, I decided to check how we can improve performance without SO_REUSEPORT.

Tests

I did it on "clean" DNS server without coreDNS. I prepared main.go and performed such tests:

Tests with single socket, single fd and multiple servers.

1 server	2 servers	3 servers
49327 QPS	47855 QPS	47066 QPS

So, if we will read the same connection in parallel, it doesn't perform better.

Tests with single socket and multiple fd.

1 fd	2 fd	3 fd	4 fd	5 fd	6 fd	7 fd
49327 QPS	86485 QPS	99720 QPS	109206 QPS	110077 QPS	113787 QPS	114913 QPS

after 3-4 fd it does not give a significant increase. But it is possible x2 QPS using this approach. From code perspective it looks strange, because in go we cannot create socket file directly, this method is private and because of this we need to create PacketConn, than take a File and create another PacketConn using the same File.

Tests with multiple sockets (SO_REUSEPORT) You know how it works, but let me duplicate how it works in "clean" DNS server without coreDNS

1 socket	2 sockets	3 sockets	4 sockets	5 sockets	6 sockets	7 sockets	8 sockets
49585 QPS	83933 QPS	114579 QPS	139543 QPS	167460 QPS	195156 QPS	201934 QPS	235415 QPS

Tests with multiple sockets with multiple fd And the final test that combined 2 and 3. ss -ulpn | grep ":55" for 3 sockets with 3 fd looks like:

UNCONN 0      0                        *:55               *:*    users:(("main",pid=2463974,fd=14),("main",pid=2463974,fd=12),("main",pid=2463974,fd=3))
UNCONN 0      0                        *:55               *:*    users:(("main",pid=2463974,fd=13),("main",pid=2463974,fd=9),("main",pid=2463974,fd=6)) 
UNCONN 0      0                        *:55               *:*    users:(("main",pid=2463974,fd=11),("main",pid=2463974,fd=10),("main",pid=2463974,fd=7))

I thought I'd get more, but...

	2 sockets	3 sockets	4 sockets	5 sockets	6 sockets
2 fd	150239 QPS	223003 QPS	214157 QPS	219193 QPS	213491 QPS
3 fd	199485 QPS	199422 QPS	194780 QPS	-	-
4 fd	187982 QPS	188690 QPS	-	-	-

Сonclusions

So based on 1 and 2 it looks like we have some problems with reading from the file. Parallel reading does not give an increase, but working with different file descriptors for the same file does. However, I do not know how concurrency works in that case. Do we have some locks while reading from file in that case and where?

Method 2 can be used if it is not possible to enable SO_REUSEPORT. However, in other cases, the solution with SO_REUSEPORT looks better from code perspective and works better. The combination of these methods does not provide any improvements at all.

I am posting the results here for history. Maybe someone will have some ideas what the problem might be and can figure out how to improve performance without running multiple servers on the same port and using SO_REUSEPORT.

5HT2 commented 20 hours ago

However, I do not know how concurrency works in that case. Do we have some locks while reading from file in that case and where?

Perhaps this stacktrace would be helpful @Shmillerov: https://github.com/coredns/coredns/issues/6573#issuecomment-2358900049, if I'm correctly understanding which locks you're referring to (or lack thereof? sometimes? I'd have to investigate more).

Shmillerov commented 11 hours ago

However, I do not know how concurrency works in that case. Do we have some locks while reading from file in that case and where?

Perhaps this stacktrace would be helpful @Shmillerov: #6573 (comment), if I'm correctly understanding which locks you're referring to (or lack thereof? sometimes? I'd have to investigate more).

Based on your stacktrace, we have a lock for FD poll.(*FD).writeLock. You have single socket file, single FD and lock works great for your case.

But in my test I have a single socket file and multiple FD. And it looks like we doesn't have any locks in that case between FD. It's not clear how this approach will work from concurrency perspective.

By the way, we decided to use numsockets approach. Based on this test https://github.com/coredns/coredns/pull/6882#discussion_r1818522596 we decided that it works pretty good. With numsockets and SO_REUSEPORT we will have multiple sockets and single FD for each socket. Kernel will distribute load between these sockets and we doesn't have any issues with concurrency in that case.

coredns / coredns