NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.02k stars 346 forks source link

exceeded the maximum nameserver nxdomains #362

Open gzzhangxinjie opened 3 years ago

gzzhangxinjie commented 3 years ago

In my unbound server, I found the run log like

error: SERVFAIL <a.b.example.com A IN>: exceeded the maximum nameserver nxdomains`

(I use the a.b.example.com replace the true domain。)

The questions are :

  1. what situation will cause this error?
  2. we found that,not only a.b.example.com cannot find the answer,xx.example.com cannot find the answer, neither。 why ?
catap commented 4 months ago

@CompuRoot that seems another side of the same issue.

Inside the last huge logs I've discovers that timeout between reattemp is increasing, and I saw something like 20 seconds.

Probably on larger setup the backoff timeout betwee retry so huge that is seems like it's stuk.

Frankly speaking I prefer smaller upper bond for timeout like 1-2 seconds that should keep 32 requests in 1 minute.

catap commented 4 months ago

After reading the code, I have some ideas:

infra-keep-probing: yes
infra-cache-max-rtt: 2000

Here I prevent unbound to mark NS as broken, and decrease the maximum timeout between attemtps to 2 seconds.

catap commented 4 months ago
infra-cache-max-rtt: 2000

Seems makes things worst, revert it.

wcawijngaards commented 4 months ago

The issue seems to be that the upstream servers are not responding particularly well. And unbound is very cautious in dealing with that. It is caused by others that are under pressure, and unbound very nicely backs off. The back off is exponential and goes to 24h easily and then stops querying the upstream, until it is up again. But in this case, you want unbound to keep trying, I am not sure what options make it do that, the infra-keep-probing is a good one. Dropping infra-host-ttl: 5 to a very small value could work, but I am not sure how useful it is. The number of retries is not hardcoded, with outbound-msg-retry.

catap commented 4 months ago

@wcawijngaards here the issue: unbound is default DNS resolver which ships with OpenBSD for example. So, a lot of users uses it as DNS BL resolver for theri mail setup.

When someone makes a mail setup, he has usually two different strategy to use DNS BL.

  1. On incomming connection make lookup that blocks the requests, and if DNS doesn't work reject the connection as temporarry error;
  2. On internal delivery make the lookup and if DNS doesn't work keep email in the queue.

On both strategy default behaviour leads to delay of emails. And if unbound introduced 24h delay because one of used list temporarry unavailable, it quite sad.

If unbound needs a few minutes to resolve domain, it isn't an issue. Spamer probable reject his attempt that's good and legetime mail server retry to delivery it in couple of minutes (like 15) when DNS records are ready to be used, if we using (1). In the (2) blocking on waiting DNS internally cheaper, just wait when it's done and when it's done mail continues.

So, in my point of view to use unbound on email setup it should:

  1. never marks upstream as down and infra-keep-probing: yes should do this.
  2. increases number of retry to something large, like outbound-msg-retry: 128
  3. decrease the maximum retransmit timeout to 15 like: infra-cache-max-rtt: 15000

Am I right that this setings enough to enforce unbound never gave up on upstream and retry to get a record from an Upstream for dozen of minutes?

wcawijngaards commented 4 months ago

The settings looks like they follow the ideas. The config to make unbound send lots and lots of traffic to a server that is not responding well, is not something that I know. Mostly, there has been effort to make unbound send less traffic for servers that are in trouble.

gthess commented 4 months ago

What should happen if one of the services on the list is permanently broken?

catap commented 4 months ago

What should happen if one of the services on the list is permanently broken?

It depends on email setup :)

Someone may have a TTL to response from DNS BL inside used software like spam assistan or rspamd, which is configured as DNS request timeout.

Anyway, this is expected place to configure such settings. And not hidden unbound settings which leads to 24h delay.

catap commented 4 months ago

Mostly, there has been effort to make unbound send less traffic for servers that are in trouble.

The issue that 24h ban of DNS BL server makes unbound absolutley useless.

Anyway, DNS BL servers is quite special and designed to be ready for that spike of traffic.

Frankly speaking I guess my 10% of UDP packet los to that server is a kind of ignore requests on overload. I've run in the same amazon EC2 region that is used by b.barracudacentral.org a VM and run UDP ping via hping from one of my mail server, and it hasn't got any packet loss.

CompuRoot commented 4 months ago

What should happen if one of the services on the list is permanently broken?

Most email systems aren't depends just on one single RBL list, but getting multiple independent sources to get picture of sender. Usually email administrators assigning to each RBL weight (how they trust them) and combine decisions about sender as cumulative spam weight. If some RBLs get "out of business" due to unbound stop processing them then cumulative spam weight reduced and scoundrels has a better choice to sneak into inbox instead of spam folder or been rejected.

wcawijngaards commented 4 months ago

From my reading of the code the setting infra-cache-max-rtt: 100 (msec) should do what you ask. Even in the presence of packet loss, the timeout is never made very large. That makes Unbound continue to send queries there, they are not turned away. If you are located further away from the other server, perhaps 500.

catap commented 4 months ago

Well, after reading rbldnsd sources I seems to start to understand behaviour on other side.

rbldnsd is quite trivial server which doesn't support multiple processes or thread. So, it single process and it simple reads one UDP request from network bufer, parses it and reply if request valid.

Any spike of traffic to DNS BL leads to overflow network buffers on the server and unbound user may enjoy penalty for that.

Seems stupid, yeah.

catap commented 4 months ago

From my reading of the code the setting infra-cache-max-rtt: 100 (msec) should do what you ask. Even in the presence of packet loss, the timeout is never made very large. That makes Unbound continue to send queries there, they are not turned away. If you are located further away from the other server, perhaps 500.

I've tried to run infra-cache-max-rtt: 2000 as I pointed in https://github.com/NLnetLabs/unbound/issues/362#issuecomment-2078274111 but it doesn't help, because 5 attempts is still here.

Now, I'm running settings from https://github.com/NLnetLabs/unbound/issues/362#issuecomment-2079302241 for an hour and no issue. Yet?

wcawijngaards commented 4 months ago

Glad to hear there are no issues. Perhaps combine the settings, eg, with number of retries and the max rtt, if there seems to be a need for better settings.

catap commented 4 months ago

Glad to hear there are no issues. Perhaps combine the settings, eg, with number of retries and the max rtt, if there seems to be a need for better settings.

let's wait couple of days before say it help.

gthess commented 4 months ago

I would not use high number of retries. If the server has a problem your client does not help. Do you need this in order for Unbound to not reply with SERVFAIL and eventually get an answer?

catap commented 4 months ago

I would not use high number of retries. If the server has a problem your client does not help. Do you need this in order for Unbound to not reply with SERVFAIL and eventually get an answer?

I agree that it is bad practice but I do not understand why it is unique issue for unbound, when @CompuRoot points that PowerDNS hasn't got such behaviour.

My point is increase number of retries to avoid temporarry overload of that server. Thus, I run it as:

    infra-keep-probing: yes
    outbound-msg-retry: 128
    infra-cache-max-rtt: 15000

that makes 4 requests per minute in case of overload of the server, which seems quite delicate, isn't it?

CompuRoot commented 4 months ago

Do you need this in order for Unbound to not reply with SERVFAIL and eventually get an answer?

Yes, that's what other DNS clients do without any special tweaking

CompuRoot commented 4 months ago

Seems stupid, yeah.

AFAIK, at least barracuda running something else for sure and I saw also they shuffling A record of RBL NS, probably to mitigate DDoS

gthess commented 4 months ago

that makes 4 requests per minute in case of overload of the server, which seems quite delicate, isn't it?

For one Unbound yes. Now imagine people copying your configuration all over the place. A single Unbound will be full of said queries since it has to retry 128 times over a long period of time.

Furthermore, how long is the asking client going to wait for an answer from Unbound?

So the issue is not the initial SERVFAIL, is that Unbound keeps a long history remembering that the upstream is bad. This can be solved by configuring the infra-cache with infra-keep-probing: yes and a small value for infra-cache-ttl.

What it would also make sense to me in this case is for a SERVFAIL from Unbound to be treated as a temporary failure for a nameserver on a list and the email process to continue after some retries with some kind of agreement from the rest of the services. Does one service being unavailable (in this case barracuda) kills the whole thing?

catap commented 4 months ago

What it would also make sense to me in this case is for a SERVFAIL from Unbound to be treated as a temporary failure for a nameserver on a list and the email process to continue after some retries with some kind of agreement from the rest of the services. Does one service being unavailable (in this case barracuda) kills the whole thing?

Kills? No. Just delayed the delivery of message. Probably for something from 15 minutes to an hour.

Right now many persons treats email like something that should be delivered instantly, they sends code to confirm your identity or similar things with TTL like 15 minutes.

And such unbound behaviour makes email setup useless for that case.

Frankly speaking it requires quite a time to discover that behaviour of unbound which is seems unusual in comparing with another DNS client.

catap commented 4 months ago

And here it is again:

Apr 26 16:05:06 mx1 unbound: [69813:0] error: SERVFAIL <49.103.126.176.in-addr.arpa. PTR IN>: all servers for this domain failed, at zone 103.126.176.in-addr.arpa. no server to query no addresses for nameservers
catap commented 4 months ago

Seems that setting in https://github.com/NLnetLabs/unbound/issues/362#issuecomment-2079418010 isn't enough.

Inside mailserver log I do have:

Apr 26 16:05:06 mx1 smtpd[98788]: d8302d771f1fbe0a smtp connected address=176.126.103.49 host=<unknown>
Apr 26 16:06:21 mx1 smtpd[57137]: dnsbl: d8302d771f1fbe0a DNS error 2 on b.barracudacentral.org
Apr 26 16:06:21 mx1 smtpd[98788]: d8302d771f1fbe0a smtp disconnected reason=disconnect

that means that it replied SERVFAIL in 1 minute 20 seconds, which is too fast at least from point how I udnerstand that settings.

CompuRoot commented 4 months ago

Does one service being unavailable (in this case barracuda) kills the whole thing?

No, it doesn't, as I said previously, cumulative SPAM weight combined form all other RBL get lowered and since some of RBL valued much higher than others it can effect reception greatly, by delaying reception in greylisting and depend on organization's requirements it either passing to a spam folder with delay or rejects senders, but it all unrelated details, there no universal setups, some require to reject, other requires to keep at least in SPAM box, but other are sensitive to any incoming traffic and don't want to miss anything, but due to failure with RBL, people getting viruses and scam in that case.

gthess commented 4 months ago

And here it is again:

Apr 26 16:05:06 mx1 unbound: [69813:0] error: SERVFAIL <49.103.126.176.in-addr.arpa. PTR IN>: all servers for this domain failed, at zone 103.126.176.in-addr.arpa. no server to query no addresses for nameservers

This is another problem, related to the actual problem of this issue:

dig ns2.as210546.net

; <<>> DiG 9.18.24 <<>> ns2.as210546.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 5790
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;ns2.as210546.net.      IN  A

;; AUTHORITY SECTION:
net.            861 IN  SOA a.gtld-servers.net. nstld.verisign-grs.com. 1714142545 1800 900 604800 86400

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP)
;; WHEN: Fri Apr 26 16:43:22 CEST 2024
;; MSG SIZE  rcvd: 118

ns2.as210546.net is the nameserver for 103.126.176.in-addr.arpa and it does not exist, together with the ns1 variant. So not timeouts for this.

catap commented 4 months ago

The back off is exponential and goes to 24h easily and then stops querying the upstream, until it is up again.

BTW I wonder how such time was legitimized. If unbound is installed in my wifi router at home, and decided to ban some upstream server, or the only upstream server... that should I do?

catap commented 4 months ago

This is another problem, related to the actual problem of this issue

Oops, wrong log line.

Interesting, that I haven't got any other error log from unbound near that time.

wcawijngaards commented 4 months ago

Well it probes the upstream, infrequently, to see if it is up, I think once every 15 minutes or so by default, with a single query at a time. And this is very nice and the result of Unbound having the support to protect servers from large traffic amounts. And making it not send queries to servers that are down. But here is a counter case where the packet loss should be dealt with by more pressure with packets sent out. But perhaps things are working, a lack of logs and other issues? Otherwise I would suggest something like the settings from earlier, but with lower values, like: infra-keep-probing: yes outbound-msg-retry: 12 infra-cache-max-rtt: 380. Because the maximum number of packets is not that big anyway, and with 12 it then tries other hosts that could perhaps help, and 380 msec because it is much smaller but likely works as a timeout, chosen very similar to the unknown server timeout but slightly bigger.

catap commented 4 months ago

Run on settings:

    infra-keep-probing: yes
    outbound-msg-retry: 12
    infra-cache-max-rtt: 380
catap commented 4 months ago

@wcawijngaards your settings almost immideatly lead to errors on unbound log:

Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.dnsbl.spfbl.net. A IN>: all servers for this domain failed, at zone dnsbl.spfbl.net. from 54.233.253.229 nodata answer
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.dnswl.spfbl.net. A IN>: all servers for this domain failed, at zone dnswl.spfbl.net. from 54.233.253.229 nodata answer
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.truncate.gbudb.net. A IN>: all servers for this domain failed, at zone gbudb.net. from 168.215.181.5 nodata answer
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.bl.0spam.org. A IN>: all servers for this domain failed, at zone 0spam.org. from 208.92.158.10
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.rbl.0spam.org. A IN>: all servers for this domain failed, at zone 0spam.org. from 208.92.158.10 got NXDOMAIN
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.wl.0spam.org. A IN>: all servers for this domain failed, at zone 0spam.org. no server to query nameserver addresses not usable
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.dnsbl.dronebl.org. A IN>: all servers for this domain failed, at zone dnsbl.dronebl.org. from 66.70.190.122 got NXDOMAIN
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.bl.spameatingmonkey.net. A IN>: exceeded the maximum nameserver nxdomains
Apr 26 17:06:30 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.mail-abuse.blacklist.jippg.org. A IN>: all servers for this domain failed, at zone blacklist.jippg.org. from 35.79.8.25 nodata answer
Apr 26 17:06:31 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.psbl.surriel.com. A IN>: exceeded the maximum nameserver nxdomains
Apr 26 17:06:32 mx1 unbound: [37827:0] error: SERVFAIL <40.90.88.45.b.barracudacentral.org. A IN>: all servers for this domain failed, at zone b.barracudacentral.org. upstream server timeout
wcawijngaards commented 4 months ago

I guess the previous settings were better, if not actually problem free? I wonder why.

catap commented 4 months ago

I guess the previous settings were better, if not actually problem free? I wonder why.

If you wish I may capture a debug log again for some time, like 30-40 minutes and send to you directly via email. Thus, I may include tcpdump of traffic as well.

catap commented 4 months ago

I'd like to confirm that settings:

    infra-keep-probing: yes
    outbound-msg-retry: 128
    infra-cache-max-rtt: 15000

reduces number of DNSBL errors from 73 to 3 in 24h. I do have two errors from b.barracudacentral.org and one from zen.spamhaus.org but it's much better than before.

Forza-tng commented 4 months ago

I'd like to confirm that settings:

  infra-keep-probing: yes
  outbound-msg-retry: 128
  infra-cache-max-rtt: 15000

reduces number of DNSBL errors from 73 to 3 in 24h. I do have two errors from b.barracudacentral.org and one from zen.spamhaus.org but it's much better than before.

Are you also using qname-minimisation: no with this config?

catap commented 4 months ago

On Wed, 08 May 2024 20:58:05 +0100, Forza @.***> wrote:

Are you also using qname-minimisation: no with this config?

Nope, I've commented it and it runs with default settings.

Rigth now I run it with settings:

    infra-keep-probing: yes
    outbound-msg-retry: 16
    infra-cache-min-rtt: 2000
    infra-cache-max-rtt: 15000

which leads to few dozen errors per day.

-- wbr, Kirill

Forza-tng commented 4 months ago

On Wed, 08 May 2024 20:58:05 +0100, Forza @.***> wrote: Are you also using qname-minimisation: no with this config? Nope, I've commented it and it runs with default settings. Rigth now I run it with settings:

infra-keep-probing: yes
outbound-msg-retry: 16
infra-cache-min-rtt: 2000
infra-cache-max-rtt: 15000

which leads to few dozen errors per day.

Thanks, I came here too because I see these errors. My use case is similar, a VPS email server with rspamd, exim and unbound. Even with those settings, I get some errors like that. It is disconcerting.

I am using unbound 1.19.3 with these settings:

    qname-minimisation: no
    infra-keep-probing: yes
    infra-cache-max-rtt: 15000
    outbound-msg-retry: 128
    infra-cache-min-rtt: 2000

Update: i switched to unbound 1.20.0 and restarted the server. Within a few seconds there waa another error:

2024-05-09T07:51:45.912+00:00 info unbound: [6211:0] info: start of service (unbound 1.20.0).
2024-05-09T07:51:57.790+00:00 info unbound: [6211:0] info: generate keytag query _ta-4f66. NULL IN
2024-05-09T07:52:02.803+00:00 err unbound: [6211:3] error: SERVFAIL <c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN>: exceeded the maximum nameserver nxdomains
❯ unbound-host -d c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org
[1715241617] libunbound[6493:0] notice: init module 0: validator
[1715241617] libunbound[6493:0] notice: init module 1: iterator
[1715241617] libunbound[6493:0] info: resolving c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: priming . IN NS
[1715241617] libunbound[6493:0] info: response for . NS IN
[1715241617] libunbound[6493:0] info: reply from <.> 2001:500:2::c#53
[1715241617] libunbound[6493:0] info: query response was ANSWER
[1715241617] libunbound[6493:0] info: priming successful for . NS IN
[1715241617] libunbound[6493:0] info: response for c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <.> 2001:7fe::53#53
[1715241617] libunbound[6493:0] info: query response was REFERRAL
[1715241617] libunbound[6493:0] info: response for c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <org.> 2001:500:b::1#53
[1715241617] libunbound[6493:0] info: query response was REFERRAL
[1715241617] libunbound[6493:0] info: response for c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <dnswl.org.> 2a01:4f8:c0c:4526::2#53
[1715241617] libunbound[6493:0] info: query response was REFERRAL
[1715241617] libunbound[6493:0] info: resolving a.ns.dnswl.org. AAAA IN
[1715241617] libunbound[6493:0] info: resolving a.ns.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: response for a.ns.dnswl.org. AAAA IN
[1715241617] libunbound[6493:0] info: reply from <dnswl.org.> 94.130.169.93#53
[1715241617] libunbound[6493:0] info: query response was nodata ANSWER
[1715241617] libunbound[6493:0] info: response for a.ns.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <dnswl.org.> 2a01:7e00:e000:293::b:2000#53
[1715241617] libunbound[6493:0] info: query response was nodata ANSWER
[1715241617] libunbound[6493:0] info: response for a.ns.dnswl.org. AAAA IN
[1715241617] libunbound[6493:0] info: reply from <dnswl.org.> 178.79.182.197#53
[1715241617] libunbound[6493:0] info: query response was ANSWER
[1715241617] libunbound[6493:0] info: response for a.ns.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <dnswl.org.> 173.255.240.115#53
[1715241617] libunbound[6493:0] info: query response was ANSWER
[1715241617] libunbound[6493:0] info: response for c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <list.dnswl.org.> 139.162.192.198#53
[1715241617] libunbound[6493:0] info: query response was NXDOMAIN ANSWER
[1715241617] libunbound[6493:0] info: resolving 2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: response for c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org. A IN
[1715241617] libunbound[6493:0] info: reply from <list.dnswl.org.> 139.162.192.198#53
[1715241617] libunbound[6493:0] info: query response was NXDOMAIN ANSWER
Host c.2.8.0.0.0.0.0.0.0.0.0.0.0.0.0.f.1.e.f.0.0.4.f.1.1.1.0.1.0.a.2.list.dnswl.org not found: 3(NXDOMAIN).
catap commented 4 months ago

@Forza-tng I suggest to switch qname-minimisation: no off, it should reduce errors a bit.

It allows to avoid a ban of NS by unbound for days and I may live with it.

Yes, it allows to sneaks some spam, but it is my trade: reactive delivering of mail vs less spam in INBOX:

Forza-tng commented 4 months ago

@Forza-tng I suggest to switch qname-minimisation: no off, it should reduce errors a bit.

It allows to avoid a ban of NS by unbound for days and I may live with it.

Yes, it allows to sneaks some spam, but it is my trade: reactive delivering of mail vs less spam in INBOX:

Is there a difference between no and off?

i currently am testing this configuration:

    qname-minimisation: no
    infra-keep-probing: yes
    infra-cache-max-rtt: 15000
    infra-cache-min-rtt: 1000
    outbound-msg-retry: 128
    max-sent-count: 64

unfortunately the errors still happen only a few seconds after a restart :(


2024-05-09T10:00:36.543+00:00 info unbound: [9477:0] info: start of service (unbound 1.20.0).
2024-05-09T10:00:55.150+00:00 info unbound: [9477:0] info: generate keytag query _ta-4f66. NULL IN
2024-05-09T10:00:58.038+00:00 err unbound: [9477:1] error: SERVFAIL <mta-2d57c43b.ip4.emsmtp.us.xxxxxxxxxr.dbl.dq.spamhaus.net. A IN>: exceeded the maximum nameserver nxdomains
catap commented 4 months ago

On Thu, 09 May 2024 10:53:34 +0100, Forza @.***> wrote:

Is there a difference between no and off?

I mean to switch this off by commenting this line to use default settings which is yes.

i currently am testing this configuration:

    qname-minimisation: no
    infra-keep-probing: yes
    infra-cache-max-rtt: 15000
    infra-cache-min-rtt: 1000
    outbound-msg-retry: 128
    max-sent-count: 64

If you find something that works better, can you share it here?

-- wbr, Kirill

Forza-tng commented 4 months ago

So far qname-minimisation with yes or no doesn't seem to affect the issue.

I'm just hoping that the error is not preserved for a long time as it does increase the amount of spam reaching my inbox.

catap commented 4 months ago

On Thu, 09 May 2024 13:22:48 +0100, Forza @.***> wrote:

So far qname-minimisation with yes or no doesn't seem to affect the issue.

Same, but I do have quite a few errors to measure it. dozens per day.

I'm just hoping that the error is not preserved for a long time as it does increase the amount of spam reaching my inbox.

Same. But DNSBL checks on each connection to SMTP server, and I do use a few dozen of DNS BL and WL. So, unbound runs few thousand lookups per day on my setup and about 1% of

Just numbers:

7 b.barracudacentral.org 8 bl.spamcop.net 7 dnsbl.sorbs.net 8 spam.dnsbl.sorbs.net

when on the same day servers have 605 incomming connections and each was tested agains 36 DNS black or white lists.

So, they run 21'780 queries and only 30 of them failed or 0.14%.

Interesting but I see issues only from this DNS servers, and all other works quite stable. And issue appears on both my mail servers which is hosted in EU but in different countries.

All of this seems like specific issue between unbound and software which is used by this servers.

-- wbr, Kirill

catap commented 4 months ago

@wcawijngaards and I jsut found another use-case which leads to ban an upstream NS without good reason.

I run unbound as local resolver on my laptop with default for OpenBSD config I assume:

server:
    interface: 127.0.0.1
    interface: ::1

    access-control: 0.0.0.0/0 refuse
    access-control: 127.0.0.0/8 allow
    access-control: 10.36.25.0/24 allow
    access-control: ::0/0 refuse
    access-control: ::1 allow

    hide-identity: yes
    hide-version: yes

    auto-trust-anchor-file: "/var/unbound/db/root.key"
    val-log-level: 2

    aggressive-nsec: yes

    local-zone: "qjz9zk." always_null

remote-control:
    control-enable: yes
    control-interface: /var/run/unbound.sock

I'm connecting to internet via WiFi but the last mile isn't stable enough and can be offline for minutes.

My browser opens tons of tabs, and if I restart it when internet doesn't work, unbound bans some popular NS servers for domains in my tabs because they aren't well due network issue between the router and the internet.

After internet is reconnected, I hasn't got access to some part of it because unbound banned some NS, and I should restart it.

Frankly speaking, I guess the real user on embeded device will restart the device by power on/off, instead waiting until unbound is unban some NS.

So, this behaviour is wrong for both: servers with stable connection, and laptops with not stable connections.

Forza-tng commented 4 months ago

Perhaps a solution is to add a config option to control the logic behind the SRVFAIL, or allow it to be disabled?

catap commented 4 months ago

Perhaps a solution is to add a config option to control the logic behind the SRVFAIL, or allow it to be disabled?

From my point of view this should be disabled because I still don’t see any use case when it doesn’t create misunderstandings and digging to root cause of this issue isn’t easy which is conclusion from this thread

mnordhoff commented 4 months ago

Any authoritative server that goes down now has two problems: Whatever caused it to go down initially, and then the subsequent DDoS from resolvers aggressively retrying queries. Temporarily reducing traffic to failing or down servers is believed to improve the health of the DNS as a whole, by reducing wasteful traffic and making it more possible for servers -- especially overloaded ones -- to get back on their feet.

Additionally, if a zone has some servers that are down and some that are up, preferring those that are up makes resolution faster and more reliable.

The trade-off is that it takes resolvers a few minutes to recover from outages.

If every resolver retried more aggressively, it would make some outages worse, last longer, and cause struggling servers to fail entirely.

Nonetheless, this is already configurable (or partly configurable?) in Unbound.

CompuRoot commented 4 months ago

Temporarily reducing traffic to failing or down servers is believed to improve the health of the DNS as a whole

Which official documentation/RFC believe in that?
It isn't client's job to help downed server "to get back on their feet." It is a job of a server and infrastructure where those down server(s) hosted. That's their job to mitigate DDoS and choose appropriate hardware and architecture to handle loading and to avoid been get down, it is not a client's problem at all.
If unbound could be the only one resolver on the planet, then it might have a reason to believe in that theory, but there plenty of other DNS clients/resolvers that don't follow such undocumented way to stop processing clients queries on believe to help weak DNS servers.
If some1 would really want to DDoS some DNS server, they defiantly wouldn't use unbound as a tool to defeat a weak server, there are much more simpler and effective solutions for that.

If every resolver retried more aggressively, it would make some outages worse, last longer, and cause struggling servers to fail entirely.

"If every" - is the key, unbound is not alone, it took voluntarily obligation to be a honest gentlemen, even so nobody(its users) asked for it, while other resolvers still do their job.

If there would be a single chain of client/server that operates by the same supervision globally than it might make a sense, but unbound can't control and help to the all DNS world.

Nonetheless, this is already configurable (or partly configurable?) in Unbound.

There no clear definition and settings that can explicitly disable behavior which marks server as SERVFAIL for a long period of time even so those already "get back on their feet". Found solution in this thread is a workaround, it's still cat&mouse game that will require spending time to keep eyes on a logs and tweak unbound settings if it reached a limit and stop resolving.

catap commented 4 months ago

Found solution in this thread is a workaround, it's still cat&mouse game that will require spending time to keep eyes on a logs and tweak unbound settings if it reached a limit and stop resolving.

Which seems not complete. I still have enough errors to continue investigation because I doubt that a few percent of failed requests are acceptable and this seems quite unique issue for unbound.

catap commented 4 months ago

Why I think that it is a bug in unbound. Let assume that I do have a config:

    infra-keep-probing: yes
    outbound-msg-retry: 16
    infra-cache-min-rtt: 1000
    infra-cache-max-rtt: 1000

and aftrer I've restarted unbound to enable this config and debug log I run 4 queries to dnsbl.sorbs.net

mx2$ time host 229.237.206.109.dnsbl.sorbs.net. 127.0.0.1; time host 229.237.206.109.dnsbl.sorbs.net. 127.0.0.1 
;; connection timed out; no servers could be reached
    0m10.01s real     0m00.01s user     0m00.00s system
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host 229.237.206.109.dnsbl.sorbs.net. not found: 3(NXDOMAIN)
    0m04.47s real     0m00.00s user     0m00.01s system
mx2$ time host 232.237.206.109.dnsbl.sorbs.net. 127.0.0.1; time host 232.237.206.109.dnsbl.sorbs.net. 127.0.0.1 
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host 232.237.206.109.dnsbl.sorbs.net. not found: 3(NXDOMAIN)
    0m06.27s real     0m00.00s user     0m00.00s system
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host 232.237.206.109.dnsbl.sorbs.net. not found: 3(NXDOMAIN)
    0m00.00s real     0m00.00s user     0m00.00s system
mx2$

as you may see the second query alway works. and the first one never works.

Debug log releativly small, so, I've attached it here: unbound.log

wcawijngaards commented 4 months ago

Yes, the behaviour is caused by a desire to be good to infrastructure elements. And the options to tweak this are being used to make changes here. Unbound is meant to not agressively requery for information, and not overload servers that are down.

There is a standard about it, https://datatracker.ietf.org/doc/rfc1536/ RFC 1536 talks several times about fast retransmission as problematic. Perhaps other resolver software should also fail here.

In addition https://datatracker.ietf.org/doc/rfc4697/ RFC 4697 talks about several issues about aggressive requerying. This is not wanted.

In https://datatracker.ietf.org/doc/rfc8767/ RFC 8767 it really talks about something else, expired responses, but in section 6, the discussion veers to say that fast retries can cause congestive collapse due to rapid refreshes. It suggests a 30 second timeout, this is in unbound as serve-expired-ttl, as a result.

There is also https://datatracker.ietf.org/doc/rfc8906/ RFC 8906 that talks about the fact that the failure to communicate is causing classification problems. And also in this issue ticker, misclassification is causing some troubles. Eg. the issue of laptop and connection failure. And it seems also the mail server information provider server that fails to respond, and that gives trouble in that unbound kindly back off.

So it is actually fairly important, in terms of keeping DNS service working, to have a gentle algorithm.

The suggested settings here are not that bad, actually. But it does seem to occasionally not work. For the laptop example, unbound is unaware of the network change, and can not really detect it quickly. A command from unbound-control, like, the command to flush the infra cache, could be used to set things up after a network change.

For the mail server issues, it seems to be compounded by nxdomain answers that it must be giving, since those are quoted in logs. Apart from failures to communicate that are not interpreted as a signal to retry rapidly, but to stay away from the server, by Unbound's current algorithm. But the keep probing option is there to, somewhat, alleviate that.

wcawijngaards commented 4 months ago

For the sorbs.net query logs. It seems like the query to sorbs.net does not get answered. It then retries, and again, and again, and then gets this answer:

[1715521765] unbound[46551:0] info: reply from <sorbs.net.> 108.59.170.201#53
[1715521765] unbound[46551:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: REFUSED, id: 0
;; flags: qr ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 
;; QUESTION SECTION:
dnsbl.sorbs.net.    IN  A

;; ANSWER SECTION:

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 33

Then after more retries it gets reply from <sorbs.net.> 72.12.198.241#53 with a long list of NS records for dnsbl.sorbs.net. To rbldns1.sorbs.net. to rbldns18, it comes with 18 ip4 addresses. Unbound tries to send to several addresses from the list, but receives a number of timeouts for that, no response. Then it gets a response:

[1715521777] unbound[46551:0] info: reply from <dnsbl.sorbs.net.> 72.12.198.248#53
[1715521777] unbound[46551:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NXDOMAIN, id: 0
;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 
;; QUESTION SECTION:
229.237.206.109.dnsbl.sorbs.net.    IN  A

;; ANSWER SECTION:

;; AUTHORITY SECTION:
dnsbl.sorbs.net.    3600    IN  SOA rbldns0.sorbs.net. dns.isux.com. 1715521681 7200 7200 604800 3600

;; ADDITIONAL SECTION:
;; MSG SIZE  rcvd: 105

This is an NXDOMAIN response for the target query, from the upstream server. And then it attempts to perform DNSSEC validation. This works and rhe query is answered.

Then the query comes in for 232..

At this point it receives NXDOMAIN for the 109.dnsbl.sorbs.net A query due to query-minimisation. It attempts to perform nxdomain validation for query minimisation, but this fails and this is could be a bug, perhaps unrelated, as it picks up the cached nxdomain in reply to that.

But that does not seem to fail the resolution here, it continues to work.

The reason for the first lookup to fail that was in the list of commandlines is because it took a long time, due to the number of timeouts, due to the upstream server not responding.