NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.16k stars 360 forks source link

log-servfail fails to log host SERVFAIL responses in Unbound 1.19.2 on Ubuntu 24.04.1 LTS #1193

Open shubhhq opened 3 days ago

shubhhq commented 3 days ago

Describe the bug On Ubuntu 24.04.1 LTS Noble, log-servfail is not working for SERVFAIL queries, while log-replies continues to function normally.

To reproduce

Expected behavior SERVFAIL responses should be logged when log-servfail is enabled.

System:

Configure line: --build=aarch64-linux-gnu --prefix=/usr --includedir=${prefix}/include --mandir=${prefix}/share/man --infodir=${prefix}/share/info --sysconfdir=/etc --localstatedir=/var --disable-option-checking --disable-silent-rules --libdir=${prefix}/lib/aarch64-linux-gnu --runstatedir=/run --disable-maintainer-mode --disable-dependency-tracking --with-pythonmodule --with-pyunbound --enable-subnet --enable-dnstap --enable-systemd --enable-cachedb --with-libhiredis --with-libnghttp2 --with-chroot-dir= --with-dnstap-socket-path=/run/dnstap.sock --disable-rpath --with-pidfile=/run/unbound.pid --with-libevent --enable-tfo-client --with-rootkey-file=/usr/share/dns/root.key --disable-flto --enable-tfo-server Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.13 30 Jan 2024 Linked modules: dns64 python cachedb subnetcache respip validator iterator TCP Fastopen feature available

BSD licensed, see LICENSE in source package for details. Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues



**Additional information**
1. log-replies works correctly for all verbosity levels.
2. SERVFAIL errors for CNAME records do generate error logs, but this behavior is inconsistent across our infrastructure.
3. The issue persists even after adjusting system parameters like net.core.wmem_max and net.core.rmem_max.
4. We've verified that log files are accessible and writable, since log-replies is working fine.
5. We've tested with different verbosity levels for logging.

We've attempted the following troubleshooting steps:

- Increased net.core.wmem_max and net.core.rmem_max values.
- Restarted the Unbound service after making changes.

We would appreciate any guidance on how to further diagnose or resolve this issue. Please let me know if you need any additional information or logs to assist in troubleshooting.
gthess commented 3 days ago

log-servfail does work for me here in both 1.19.2 and the latest version. Is this perhaps some particular servfail case that does not get logged? Could you share some more information about that and maybe the configuration file (or the relevant bits at least)?

shubhhq commented 3 days ago

Yes. On further debugging we found out that we have

    serve-expired: yes
    serve-expired-ttl: 3600            # one hour, in seconds - max time till expired records are served
    serve-expired-client-timeout: 2000  # 2 seconds, in milliseconds - Query timeout after which expired record will be served

this configuration set. On doing unbound-control set_option serve-expired no or making the serve-expired-client-timeout: 0 the servfails start getting logged again. So we know a mitigation but not the root cause.

Not sure if this is the expected behaviour. Can you help finding what's happening under the hood?

Sample config

remote-control:
    control-enable: yes
    control-port: 8953

server:
    do-ip6: no
    do-not-query-localhost: no
    interface: 127.0.0.1
    access-control: 0.0.0.0/0 allow
    username: "unbound"
    module-config: "subnetcache validator iterator"
    send-client-subnet: 10.0.0.0/8
    max-client-subnet-ipv4: 20
    rrset-roundrobin: no
    serve-expired: yes
    serve-expired-ttl: 3600            # one hour, in seconds - max time till expired records are served
    serve-expired-client-timeout: 2000  # 2 seconds, in milliseconds - Query timeout after which expired record will be served
    infra-host-ttl: 120                # 2 minutes - Limit the time till a nameserver can be on the blacklist
    infra-keep-probing: yes            # Test servers on the blacklist for reachability one at a time
    prefetch: yes
    cache-min-ttl: 1
    infra-cache-min-rtt: 250
    num-threads: 4
    msg-cache-slabs: 4
    rrset-cache-slabs: 4
    infra-cache-slabs: 4
    key-cache-slabs: 4
    rrset-cache-size: 1000m
    msg-cache-size: 2000m
    so-rcvbuf: 4m
    so-sndbuf: 4m
    so-reuseport: yes
    edns-buffer-size: 4096
    max-udp-size: 4096
    use-syslog: no
    logfile: /etc/unbound/not_working_unbound.log
    log-servfail: yes
    log-replies: yes
    qname-minimisation: no
gthess commented 3 days ago

Are you coming from a specific version prior to 1.19.2 that this worked?

shubhhq commented 3 days ago

Yes, it did on 1.17.1

gthess commented 3 days ago

When serve-expired is enabled it will try to replace a SERVFAIL answer with (expired) data from the cache. If that replace happens, there will be no logging output for the SERVFAIL that was replaced. But this behavior was there since 1.10.0 already.

Did you also change settings when moving versions?

I think there is a bug there (since 1.10.0 already) that if picking a recently cached record (probably the same SERVFAIL that we try to replace) the log-servfail option would not work; that is with serve-expired enabled.

shubhhq commented 2 days ago

However, in our case we've not been getting the cached response (there's nothing in cache) and getting SERVFAIL as a response, which doesn't feel right.

gthess commented 1 day ago

SERVFAILs can also get cached as a backoff mechanism if resolution fails. And there is no useful record in the cache IIRC. But the interesting question at the moment is if you also changed configuration between the versions, especially for the serve-expired options.