NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.01k stars 346 forks source link

unbound kind of "loops" (100% CPU time) and is no longer reponsive #1113

Open mistersixt opened 1 month ago

mistersixt commented 1 month ago

perf-top-output top-with-threads

Describe the bug Since version 1.20.0 unbound starts using 100% CPU time after a few hours, sometimes even 200% and more, and the DNS answers take a looong time. After some further time DNS requests don't get answered at all any more. A regular "kill" does not stop the process, it needs to be "kill -9". Attached you can see the output of "top" and also from "perf top -p <-pid-of-unbound>".

The first "unbound" entries in "perf top" show "lruhash_lookup" and "rbtree_find_less_equal".

The amount of DNS requests is limited using iptables and hashlimit (150 per/minute with a burst of 45).

This behaviour is also with the current master source, 1.20.1 right now.

To reproduce Steps to reproduce the behavior:

  1. see above.

Expected behavior Unbound should be responsive all the time, and not looping after a few hours (like every 6 to 12 hours).

System:

Version 1.20.1

Configure line: --with-libevent Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.0.13 30 Jan 2024 Linked modules: dns64 respip validator iterator

Additional information Add any other information that you may have gathered about the issue here.

wcawijngaards commented 1 month ago

It should not loop like that; I would like to know what unbound is looping over. The perf says this is entirely within libevent and some anonymous functions, I assume inlined in libevent.

Is it possible to get an ordinary stack trace, with like gstack <pid> , maybe several times to catch different parts of the loop? Likely the lruhash lookup and rbtree find results are from the other threads, perhaps that have ordinary cache responses and lookups. It would be nice to be able to reproduce the issue, but I have no clue what is the cause of it.

mistersixt commented 4 weeks ago

Hi,

after I had a very similar situation with Prosody (XMPP server) running on the very same server box using 100% CPU time after a while, printing "too many open files" into the error log, I increased the "nofiles" entry in /etc/security/limit.conf, and unbound as well as Prosody are running fine since (ulimit was showing 1024, increased it to 50.000).

Cannot tell whether this this is related, but there does seem to be a connection somehow.

Kind regards, mistersixt.