Open ShipeiXu opened 5 months ago
There are several v6 addresses. Since my machine does not have a public IPv6 environment, udp connect will fail. In this case the rto of the v6 address should also grow. But there is actually no growth.
As I understand it, the servers are intentionally dropping packets causing timeout logic in Unbound. Unbound then applies the exponential backoff timer to wait more and more between each packet until the servers reach the configured upper limit. In that case Unbound will consider the servers offline and not waste traffic there.
infra-keep-probing: yes
allows Unbound to send probes to a down server in case it is up and useful again. The probes are guaranteed to happen at least each infra-host-ttl
(900 seconds by default). (The probe can also happen earlier, but if the server is still down the next probe will likely be at infra-host-ttl
.)
Now with that out of the way, failure to communicate and introducing resolver timeouts by the upstream nameservers is not Unbound's problem. There is also a relevant RFC about that https://datatracker.ietf.org/doc/rfc8906/.
The exponential backoff logic happens when a request times out, if your system does not support IPv6 the query is likely not getting out at all. In that case do-ip6: no
(or prefer-ip4
: yes if you want to serve IPv6 clients) help with upstream server selection.
I would like to close this as a non-issue but I leave it open in case I misunderstood something from your text :)
you're right. I can understand the logic. But I still expect a better logic. I tested bind and knot-resolver. They have better anti-attack performance in the same scenario. As a DNS resolver, we cannot control the requests received. When a large number of customers initiate requests for https://*.taobao.com, it will cause unbound to mark all NS of taobao.com as timed out. This process happens very quickly. Is exponential backoff a best practice? I'm negative about this implementation. bind seems to use logarithmic avoidance. Even if the authority loses a lot of packets, serverfail will not happen quickly.
"infra-keep-probing: yes "
configuration is not what I expected. I set infra-cache-max-rtt to 2000 to reproduce the problem faster. When all ns times out, I request a normal domain name again, and unbound always responds to serverfail. I don't understand the role infra-keep-probing plays.
According to my understanding, after each request is received, unbound should initiate an additional probe.
infra-keep-probing: yes
will not probe always with each request. You can try bringing the infra-host-ttl
down to see it probing when the infra record is expired.
If servers timeout Unbound would first try to aggressively back off to give them time to come back up or deal with a potential traffic spike. If they still timeout after the configured max (12 seconds by default) they are considered down and may be probed in the future. This is fine for a server under pressure, it can stay out of Unbound's selection and may be reprobed based on configuration.
This case however is for all misbehaving upstream nameservers. Fixing this case just for them brings unneeded server selection shenanigans and retries to Unbound for all other nameservers that are simply down.
I understand the "attack" you are talking off, but I don't see how this is Unbound's problem.
@ShipeiXu see https://github.com/NLnetLabs/unbound/issues/908 where do-ip6: no
as a kind of workaround, and https://github.com/NLnetLabs/unbound/issues/362 where we have very long discussion regarding the unbound logic as well.
infra's rto exponential backoff is not a good algorithm. I insist that this should be logarithmic backoff. It is infinitely close, but will not reach rto.
With exponential Unbound is trying to be generous to poorly connected nameservers by doubling the timeout while waiting for an answer. It also makes Unbound aggressive to drop non-responding nameservers from server selection by reaching the top configured timeout faster. Non-responding nameservers are servers that are either under load and can't keep up, or broken nameservers. For the former it is good that Unbound stops contacting them and preferring other nameservers. For the latter it is good that Unbound stops contacting them because they are broken and wasting Unbound's time.
Also a server's timeout needs to reach the top configured timeout (after several timeouts) since this is the criteria for a non-responsive nameserver, so that the server gets removed from the selection. Then Unbound can spend time on responsive nameservers.
There are two distinct cases in your issue though: a) a server explicitly drops packets based on qname, qtype or similar instead of replying with an appropriate answer like REFUSED for example, and b) all the nameservers for a delegation are considered down.
For a) there is nothing for Unbound to do. There is an RFC that clearly states this is wrong behavior. The faulty behavior is with the upstream.
For b) maybe Unbound needs to do something differently and facilitate more attempts, than the current none, to such delegation but this needs some thinking because it can have unexpected results in certain scenarios.
We have plans to augment server selection for configured forward/stub zones in the future and we can also revisit server selection for common nameservers.
Describe the bug Scenario, when the authoritative server does not respond to a specific domain name, unbound will cause a very serious penalty to the authoritative server. And the process of marking authoritative ns as timed out is very fast, it is an exponential process. eg: taobao.com. When I initiate a request to unbound https://taobao.com, unbound will poll the four ns servers of taobao.com, ns4.taobao.com., ns5.taobao.com., ns6.taobao.com., ns7.taobao.com. initiated https://taobao.com requests, and all requests timed out. Causes rto to double. When I request again, rto is triggered to double again. Soon, taobao.com will be marked as timed out in infra. Normal requests under the taobao.com domain will also respond to serverfail quickly. This is a loophole. We cannot control client-side requests, but we must ensure that client requests do not affect unbound's normal services.
To reproduce Steps to reproduce the behavior:
Expected behavior Like bind, rto is growing, but will not increase to the upper limit any time soon. Until infra's ttl times out, rto is reset to a smaller level.
System:
unbound -V
output:Configure line: --with-libnghttp2 --prefix=/usr/unbound/ --enable-subnet --with-pthreads --with-libevent --enable-dnstap --enable-cachedb --enable-ipsecmod --enable-ipset --enable-linux-ip-local-port-range --enable-dnscrypt --enable-systemd --with-pythonmodule Linked libs: libevent 2.0.21-stable (it uses epoll), OpenSSL 1.0.2k-fips 26 Jan 2017 Linked modules: dns64 python cachedb ipsecmod subnetcache ipset respip validator iterator DNSCrypt feature available
BSD licensed, see LICENSE in source package for details. Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues
void rtt_lost(struct rtt_info rtt, int orig) { / exponential backoff */
}