NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3k stars 346 forks source link

Unbound process sporadically returns TOO MANY Servfail and Read/Write errors at different load levels. #1105

Open maintain3r opened 1 month ago

maintain3r commented 1 month ago

Unbound version installed: 1.13.1-1ubuntu5.5 unbound runs as a regular service (no as a docker container) no packet drops are detected on the unbound host verbosity level is set to 5

The tool to test unbound: dnspyre The command used to test unbound: dnspyre -c 100 -d 60s --max=20ms -s 172.31.28.217 https://raw.githubusercontent.com/Tantalor93/dnspyre/master/data/10000-domains

Interestingly when I take the domain names that were failing and try to resolve them while the testing tool is not running I do get things resolved properly without an issue.

unbound.conf: _server: verbosity: 5 statistics-cumulative: yes extended-statistics: yes num-threads: 4 interface: 0.0.0.0 port: 53 prefer-ip6: no outgoing-range: 8192 outgoing-port-permit: 5354 so-rcvbuf: 8m so-sndbuf: 8m so-reuseport: yes ip-transparent: no ip-freebind: yes max-udp-size: 4096 msg-cache-size: 256m msg-cache-slabs: 8 num-queries-per-thread: 4096 rrset-cache-size: 640m rrset-cache-slabs: 8 cache-min-ttl: 300 cache-max-ttl: 86400 cache-max-negative-ttl: 300 infra-host-ttl: 60 infra-cache-slabs: 8 infra-cache-numhosts: 100000 do-ip4: yes do-ip6: no do-udp: yes do-tcp: yes use-systemd: no do-daemonize: no access-control: 192.168.0.0/16 allow access-control: 172.16.0.0/12 allow access-control: 10.0.0.0/8 allow access-control: 127.0.0.0/8 allow username: "unbound" directory: "/etc/unbound" use-syslog: no log-identity: "unbound" log-time-ascii: yes log-queries: no log-replies: yes log-tag-queryreply: yes pidfile: "/var/run/unbound.pid" root-hints: "/var/lib/unbound/root.hints" hide-identity: yes hide-version: yes hide-trustanchor: yes identity: "" version: "" harden-glue: yes qname-minimisation: yes use-caps-for-id: yes do-not-query-localhost: no prefetch: yes deny-any: yes rrset-roundrobin: yes minimal-responses: yes val-clean-additional: yes serve-expired: yes val-log-level: 2 key-cache-size: 10m key-cache-slabs: 8 neg-cache-size: 1m ratelimit: 0 ip-ratelimit: 0

remote-control: control-enable: yes control-use-cert: no control-interface: 127.0.0.1 control-port: 8953 server-key-file: "/etc/unbound/unbound_server.key" server-cert-file: "/etc/unbound/unbound_server.pem" control-key-file: "/etc/unbound/unbound_control.key" control-cert-file: "/etc/unbound/unbound_control.pem"

forward-zone: name: "." forward-first: yes forward-addr: 169.254.169.253@53 # aws provided vpc dns server forward-addr: 1.1.1.1@53 forward-addr: 8.8.8.8@53_

Testing results Total requests: 280881 Read/Write errors: 244061 DNS success responses: 34141 DNS negative responses: 1900 DNS error responses: 779

DNS response codes: NOERROR: 35141 SERVFAIL: 779 NXDOMAIN: 900

DNS question types: A: 280881

# Running dnspyre locally against 127.0.0.1 (unbound has a listener on this IP). Using 10 concurrent requests didn;t change almost anything, still too many errors. root@ip-172-31-28-217:/etc/unbound# dnspyre -c 10 -d 60s --max=20ms -s 127.0.0.1 https://raw.githubusercontent.com/Tantalor93/dnspyre/master/data/10000-domains Using 10000 hostnames Benchmarking 127.0.0.1:53 via udp with 10 concurrent requests Total requests: 12844 Read/Write errors: 1134 DNS success responses: 10610 DNS negative responses: 950 DNS error responses: 150

DNS response codes: NOERROR: 10960 SERVFAIL: 150 NXDOMAIN: 600

DNS question types: A: 12844

Unbound runs on Ubuntu 22.04.4 LTS RAM: 4GB CPU: 2 core aws t3.medium type host Changing instance type does not change a lot!!! CPU usage is ~30-40%

wcawijngaards commented 1 month ago

The setting use-caps-for-id: yes could be the issue, try use-caps-for-id: no. If there is fallback that needs a lot of additional queries, and this option is not common, so I think it causes load and possibly also failures.

With log-servfail: yes it would print out what the servfails are that happen. That would give a clue that point in the direction of what is the cause.

With num-threads: 4, but the host has 2 cpu cores, I would expect num-threads: 2 to be the correct choice. I would not expect that to cause the outcome, but maybe interesting.

The so-rcvbuf and so-sndbuf settings of 8m are large, and I wonder if the 4G host runs out the memory on the many requests that you cause it to queue up for recursion. Out of memory on the socket buffers, and then the recursor cannot make more socket buffers and this causes failure, perhaps.

maintain3r commented 1 month ago

Thanks @wcawijngaards Im gonna try your suggestions and will get back with the results. For the 'so-rcvbuf' and 'so-sndbuf' what should I use and how to calculate a proper value for that should I create a bigger instance with more RAM ?

wcawijngaards commented 1 month ago

I do not know a value calculation for them. Perhaps leave them at default. Or 64k for less buffer size but also less memory consumption, since the test involves opening thousands of sockets.

maintain3r commented 1 month ago

Taken from unbound official doc page: Set so-rcvbuf to a larger value (4m or 8m) for a busy server. This sets the kernel buffer larger so that no messages are lost in spikes in the traffic. Adds extra 9s to the reply-reliability percentage. The OS caps it at a maximum, on linux unbound needs root permission to bypass the limit, or the admin can use sysctl net.core.rmem_max. On BSD change kern.ipc.maxsockbuf in /etc/sysctl.conf.