NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.06k stars 349 forks source link

memory corruption related core dumps #1017

Open kerolasa opened 7 months ago

kerolasa commented 7 months ago

Describe the bug The bug has been discussed in unbound maillist.

https://lists.nlnetlabs.nl/pipermail/unbound-users/2024-February/008257.html

Short summary of the circumstances. The coredumps happen in locations that have bad network connectivity. I have a feeling cache handling has something to do with the issue. Assuming that is correct it is worthwhile to know following settings are in use.

serve-expired: yes serve-expired-ttl: 3600 serve-expired-client-timeout: 500 infra-keep-probing: yes

To reproduce Steps to reproduce the behavior:

  1. Not sure. Install lots of unbounds to poorly connected location, and wait long enough??

Expected behavior No coredumps.

System:

Configure line: --prefix=/usr --sysconfdir=/etc --disable-rpath --enable-dnscrypt --enable-dnstap --enable-pie --enable-relro-now --enable-systemd --enable-tfo-client --enable-tfo-server --with-libevent --with-libnghttp2 --with-pidfile=/run/unbound.pid --with-pythonmodule --with-pyunbound Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.11 19 Sep 2023 Linked modules: dns64 python respip validator iterator DNSCrypt feature available TCP Fastopen feature available



**Additional information**

Collection of 117 gdb backtraces: [some-backtraces.tar.gz](https://github.com/NLnetLabs/unbound/files/14348477/some-backtraces.tar.gz)
wcawijngaards commented 7 months ago

Can I ask about the compilation? In particular the coredump where it calls the key_cache_insert function in the process_dnskey_response function, and that fails. That seems to look like a miscompilation to me. That happens to me, when I git change to different versions and the dependency tracking is not good; so that the wrong object files are not compiled; causing the compiler to link in code that refers to different layout. This kind of error looks like it. Is the code from a git checkout and after a change in code, did not 'make clean' before making a new compile? Or some other dependency tracking issue, where files are copied or modified and the dates change, and there is a partial or previous compilation? In any case, a clean working directory, or 'make clean', before 'make', that would remove the problem, if the dependency tracking is an issue. I have seen similar issues with miscompilation for using experimental, eg. buggy, compiler options; like new optimizations. It could be good to use --disable-flto for that reason; the '-flto' option gives failures in a lot of reports, and one of the coredumps has the error that the code section in the core is the wrong size.

kerolasa commented 7 months ago

When we compile unbound it is packaged and the same package is used across systems here and there. There are many hundreds of active unbound instances that do not show symptoms something in the binary would be wrong.

Oh, one thing more. We never make clean when building production release. A docker container performing build will do it from clean slate starting with downloading release package, checking checksum, then compiling and packaging. Only the package is kept, rest goes to bit heaven.

To be transparent; here are our configure options.

        ./configure \
                --prefix=/usr \
                --sysconfdir=/etc \
                --disable-rpath \
                --enable-dnscrypt \
                --enable-dnstap \
                --enable-pie \
                --enable-relro-now \
                --enable-systemd \
                --enable-tfo-client \
                --enable-tfo-server \
                --with-libevent \
                --with-libnghttp2 \
                --with-pidfile=/run/unbound.pid \
                --with-pythonmodule \
                --with-pyunbound
wcawijngaards commented 7 months ago

Could the program get executed with address debugging? Perhaps that can catch the offending activity. Two options, one is valgrind, run the program using valgrind. Another is to compile with libasan, the address sanitizer, with a configure line like CFLAGS="-fsanitize=address -g -O2 -DVALGRIND" CXXFLAGS=$CFLAGS ./configure ... I would then also put --disable-flto to disable that optimization from interfering.

The error can then perhaps be caught at the time when it writes wrongly. Not so much catching it later, when the data is corrupted and a failure happens. This could be much more frequent than that data is corrupted that causes a core dump, like when it overwrites harmlessly. Seeing this kind of error at the time and place where it happens, is a good way to find it, otherwise there are no clues as to where the issue is in the program code. Take care starting the program, the debugging may make it sluggish.

The asan configure line references a define for VALGRIND, and this is used to set the hash function in unbound to not cause false positives in the memory detector, even though the address sanitizer is not valgrind, the false positive removal is convenient.