NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
2.98k stars 345 forks source link

Timeouts to forward servers on BSD based system with ASLR #887

Open Jakker opened 1 year ago

Jakker commented 1 year ago

Reported on FreeBSD bugzilla that on BSD systems with ASLR enabled that there are unexpected timeouts using forwarders. One wonders why this is not happening (or has not been noticed) on Linux type systems.

There is not enough information (yet) to judge what causes the problem.

montykabinski commented 1 year ago

After an upgrade to FreeBSD 14.2-RELEASE this bug may cause seriously unreliable operation. The operator of the system will have no warning or any obvious way to understand why a formerly solid Unbound installation has suddenly become flaky.

elfctl -e +noaslr /usr/local/sbin/unbound does work around the problem, but is an obscure solution for a difficult-to-diagnose problem.

grahamperrin commented 1 year ago

14.2-RELEASE

13.2-RELEASE, yes?

(We have not yet reached release of 14.0)


Also, for reference: elfctl(1)

montykabinski commented 1 year ago

Yes sorry, fat-fingered it, meant to type 13.2-RELEASE

rcmcdonald91 commented 1 year ago

We are also seeing this in pfSense as well. Let me know how I might be of assistance here.

One thing that would be interesting to test is if this is unique to a specific version of Unbound. Has anyone tried replicating this with an older version of Unbound?

Has anyone that can replicate this tried on the latest 14-CURRENT?

Reproducing this internally has been difficult so it would swell if those that can reproduce it reliably test these additional cases so we can start narrowing this down.

mzary commented 1 year ago

After an upgrade to FreeBSD 14.2-RELEASE this bug may cause seriously unreliable operation. The operator of the system will have no warning or any obvious way to understand why a formerly solid Unbound installation has suddenly become flaky.

Please don't scare people. This issue is probably configuration specific. I am running a bunch of dns/unbound and never have seen a problem manifesting this way. The PR on FreeBSD Bugzilla mentions forward-tls-upstream: yes option as triggering this issue.

Maltz42 commented 1 year ago

It is configuration specific, but it's a relatively common configuration in pfSense, at least, and there are several forum and reddit posts about it on that platform. SSL/TLS enabled when unbound is in forwarding mode causes the issue. It's VERY solvable/reproducible for me by toggling SSL/TLS, even though the problem itself is intermittent. iOS clients seem more affected - they may more stubbornly cache DNS failures, like I've observed on macOS. (I don't have any macOS clients to properly test, though.)

"Seriously unreliable" is not a huge exaggeration, imo, especially given how hard this frequent-but-intermittent issue is to track down to the specific config that triggers it. My personal experience had certainly been "quite annoying" at best, and definitely unacceptable in a production environment, for a few weeks before I figured it out. I was running into a complete failure of an iOS app/connection at least daily, requiring toggling WiFi (or waiting a significant amount of time) to get it to re-check the DNS and finally connect. A simple reload wouldn't cut it.

[edit: I mixed up DNSSEC with SSL/TLS. Fixed.]

theCyberTech commented 1 year ago

This is also happening on OPNSense with BSD version:

Unbound version is Version 1.17.1

None of the work around work including adding in reboots.

Philip-NLnetLabs commented 1 year ago

I put a fix for this issue in the branch 'freebsd-aslr-issue'. Please give it a try.

RobbieTT commented 1 year ago

I originally posted these observations over on pfSense but these timeouts are intertwined with ASLR , server selection and IPv6:

In an effort to explain why DoT is so slow on pfSense / unbound I have run multiple pcaps to try and understand how the unbound resolver is handling forwarded queries to the servers set in 'General Setup'. The findings are illuminating and I now understand why slow queries are selected, compounded and compounded again by TLS to the point of failure, whilst ignoring faster name servers.

On this simple successful test I am using 4 name servers from dns.quad9.net. Two are ipv4 servers and 2 more are on ipv6: ◦ 9.9.9.9 ◦ 149.112.112.112 ◦ 2620:fe::fe ◦ 2620:fe::9

From these servers a typical fast name response [for my connection] is 7ms but can be as high as 12ms. Clearly if there is a problem with a name server the response can be much slower, up to 300ms or more.

In this single-lookup example I used kia.com (as something unlikely to be used and therefore cached). The sequence: ◦ pfSense sends a single query to just 1 of the 2 ipv4 servers - 149.112.112.112 ◦ All other servers ignored ◦ Answered to unbound in 151ms ◦ pfSense / unbound then sends a single query to just 1 of the 2 ipv6 servers - 2620:fe::9 ◦ All other servers ignored ◦ Answered to unbound in 297ms ◦ DNS answered to client in 448ms ◦ This is the sum of the 2 queries, 151 + 297ms, as they are asked and answered sequentially ◦ The ipv6 query does not start until the ipv4 query is fully answered

The forwarded query does not go to all servers, one is simply picked. It does not particularly matter how fast or slow a server is; as long as it is deemed valid and returning an answer in under 400ms it can be picked. If a server normally capable of returning an answer in 7ms is struggling, but still under 400ms, it will continue to be used. Multiples of this added latency will then pollute the back-and-forth of the DoT TCP and TLS handshakes, leading to a considerable delay or potentially a failure. I have no answer as to why the attempt at using a ipv6 server only starts once the ipv4 DoT sequence is completed. Hopefully someone with more unbound insight can answer this element?

For those of us with upstream name servers normally operating in the 7 to 12ms range the acceptance of up to 400ms seems ridiculous. The somewhat random choice of server used does little for the client but clearly eases the load at the upstream provider. Not having an option in pfSense to ask all servers and utilise the fastest compounds matters further. Only starting an ipv6 query once ipv4 has completed is another unhealthy delay. Added all together along with ASLR and the additional handshakes of TCP/TLS we are left with a slow and potentially unreliable DoT capability.

1684238610785-2023-05-16-at-10 24 21
Philip-NLnetLabs commented 1 year ago

I originally posted these observations over on pfSense but these timeouts are intertwined with ASLR , server selection and IPv6:

Does your issue improve if you disable ASLR? Did you try the patch that was added to this issue? Your issue does not seem to be ASLR related. Please create a separate issue if this is the case. It is likely that this issue will be closed soon because the original problem is solved.

RobbieTT commented 1 year ago

Does your issue improve if you disable ASLR? Did you try the patch that was added to this issue? Your issue does not seem to be ASLR related.

The issues are intertwined as it is the totality of the delays that triggers the timeout failure. The patch was applied in pfSense above and it does make a measurable difference as it reduces the overall sum. In equal measure, I can remove either the IPv4 or the IPv6 name servers and the ASLR problem goes away, with or without the patch, as the total time can be cut roughly in half before being served an answer. Add in the final variable of a slow server being selected and the totality may or may not be enough to trigger the timeout. Those with a slower RTT to a name sever will have an additional sum.

The timeout error was the symptom and is seemingly caused by the sum of:

Timeout Error (time) = The sequential queries of IPv4 then IPv6 servers + SSL/TLS handshake + ASLR + Query of a single slow IPv4 name server + Query of a single slow IPv6 name server + normal RTT of name server

Not all users have all elements of the maths above so may have never experienced the timeout error with or without the ASLR changes. The ASLR patch helps but the remaining factors may still trigger the timeout error as it is a cumulative problem. As an example, simply disabling IPv6 name servers or issuing queries to iPv4 and IPv6 name servers concurrently would have 'resolved' the original issue for many users and effectively masked the ASLR issue.

My concern is that the original timeout issue remains a latent fault as ASLR was just a factor in the timeout maths. I'm not sure how we address the multiple factors cohesively if they are parsed-out too cleanly.

Edit: With the commit I have added the IPv4 / IPv6 sequential response issue to #899:

https://github.com/NLnetLabs/unbound/issues/899

☕️

Jakker commented 8 months ago

It turns out that the issue with ASLR is caused by a problem in the clang sanitises with ASLR enabled. There is an BSD errata out which can be found here.