Open maintain3r opened 3 months ago
The setting use-caps-for-id: yes
could be the issue, try use-caps-for-id: no
. If there is fallback that needs a lot of additional queries, and this option is not common, so I think it causes load and possibly also failures.
With log-servfail: yes
it would print out what the servfails are that happen. That would give a clue that point in the direction of what is the cause.
With num-threads: 4
, but the host has 2 cpu cores, I would expect num-threads: 2
to be the correct choice. I would not expect that to cause the outcome, but maybe interesting.
The so-rcvbuf
and so-sndbuf
settings of 8m are large, and I wonder if the 4G host runs out the memory on the many requests that you cause it to queue up for recursion. Out of memory on the socket buffers, and then the recursor cannot make more socket buffers and this causes failure, perhaps.
Thanks @wcawijngaards Im gonna try your suggestions and will get back with the results. For the 'so-rcvbuf' and 'so-sndbuf' what should I use and how to calculate a proper value for that should I create a bigger instance with more RAM ?
I do not know a value calculation for them. Perhaps leave them at default. Or 64k for less buffer size but also less memory consumption, since the test involves opening thousands of sockets.
Taken from unbound official doc page: Set so-rcvbuf to a larger value (4m or 8m) for a busy server. This sets the kernel buffer larger so that no messages are lost in spikes in the traffic. Adds extra 9s to the reply-reliability percentage. The OS caps it at a maximum, on linux unbound needs root permission to bypass the limit, or the admin can use sysctl net.core.rmem_max. On BSD change kern.ipc.maxsockbuf in /etc/sysctl.conf.
Unbound version installed: 1.13.1-1ubuntu5.5 unbound runs as a regular service (no as a docker container) no packet drops are detected on the unbound host verbosity level is set to 5
The tool to test unbound: dnspyre The command used to test unbound: dnspyre -c 100 -d 60s --max=20ms -s 172.31.28.217 https://raw.githubusercontent.com/Tantalor93/dnspyre/master/data/10000-domains
Interestingly when I take the domain names that were failing and try to resolve them while the testing tool is not running I do get things resolved properly without an issue.
unbound.conf: _server: verbosity: 5 statistics-cumulative: yes extended-statistics: yes num-threads: 4 interface: 0.0.0.0 port: 53 prefer-ip6: no outgoing-range: 8192 outgoing-port-permit: 5354 so-rcvbuf: 8m so-sndbuf: 8m so-reuseport: yes ip-transparent: no ip-freebind: yes max-udp-size: 4096 msg-cache-size: 256m msg-cache-slabs: 8 num-queries-per-thread: 4096 rrset-cache-size: 640m rrset-cache-slabs: 8 cache-min-ttl: 300 cache-max-ttl: 86400 cache-max-negative-ttl: 300 infra-host-ttl: 60 infra-cache-slabs: 8 infra-cache-numhosts: 100000 do-ip4: yes do-ip6: no do-udp: yes do-tcp: yes use-systemd: no do-daemonize: no access-control: 192.168.0.0/16 allow access-control: 172.16.0.0/12 allow access-control: 10.0.0.0/8 allow access-control: 127.0.0.0/8 allow username: "unbound" directory: "/etc/unbound" use-syslog: no log-identity: "unbound" log-time-ascii: yes log-queries: no log-replies: yes log-tag-queryreply: yes pidfile: "/var/run/unbound.pid" root-hints: "/var/lib/unbound/root.hints" hide-identity: yes hide-version: yes hide-trustanchor: yes identity: "" version: "" harden-glue: yes qname-minimisation: yes use-caps-for-id: yes do-not-query-localhost: no prefetch: yes deny-any: yes rrset-roundrobin: yes minimal-responses: yes val-clean-additional: yes serve-expired: yes val-log-level: 2 key-cache-size: 10m key-cache-slabs: 8 neg-cache-size: 1m ratelimit: 0 ip-ratelimit: 0
remote-control: control-enable: yes control-use-cert: no control-interface: 127.0.0.1 control-port: 8953 server-key-file: "/etc/unbound/unbound_server.key" server-cert-file: "/etc/unbound/unbound_server.pem" control-key-file: "/etc/unbound/unbound_control.key" control-cert-file: "/etc/unbound/unbound_control.pem"
forward-zone: name: "." forward-first: yes forward-addr: 169.254.169.253@53 # aws provided vpc dns server forward-addr: 1.1.1.1@53 forward-addr: 8.8.8.8@53_
Testing results Total requests: 280881 Read/Write errors: 244061 DNS success responses: 34141 DNS negative responses: 1900 DNS error responses: 779
DNS response codes: NOERROR: 35141 SERVFAIL: 779 NXDOMAIN: 900
DNS question types: A: 280881
# Running dnspyre locally against 127.0.0.1 (unbound has a listener on this IP). Using 10 concurrent requests didn;t change almost anything, still too many errors. root@ip-172-31-28-217:/etc/unbound# dnspyre -c 10 -d 60s --max=20ms -s 127.0.0.1 https://raw.githubusercontent.com/Tantalor93/dnspyre/master/data/10000-domains Using 10000 hostnames Benchmarking 127.0.0.1:53 via udp with 10 concurrent requests Total requests: 12844 Read/Write errors: 1134 DNS success responses: 10610 DNS negative responses: 950 DNS error responses: 150
DNS response codes: NOERROR: 10960 SERVFAIL: 150 NXDOMAIN: 600
DNS question types: A: 12844
Unbound runs on Ubuntu 22.04.4 LTS RAM: 4GB CPU: 2 core aws t3.medium type host Changing instance type does not change a lot!!! CPU usage is ~30-40%