exceeded the maximum nameserver nxdomains

gzzhangxinjie commented 3 years ago

In my unbound server, I found the run log like

error: SERVFAIL <a.b.example.com A IN>: exceeded the maximum nameserver nxdomains`

(I use the a.b.example.com replace the true domain。)

The questions are :

what situation will cause this error？
we found that，not only a.b.example.com cannot find the answer，xx.example.com cannot find the answer, neither。 why ?

catap commented 5 months ago

@wcawijngaards regarding Laptop issue. I do connect to WiFi network, but the router is connected to the internet via LTE uplink which isn't stable. I have no idea which component should send command to invalidate unbound infra cache. Thus, similar setup can be found in hotel or anywhere in the world, where WiFi just installed and it somehow works.

So it is actually fairly important, in terms of keeping DNS service working, to have a gentle algorithm.

Probably, but such algorithm shouldn't lead to WTF from users which quite tricky to dig and understand. For example, RFC 1536 states that BIND has good algorithm, and frankly speaking BIND is a kind reference implementation for DNS, and BIND hasn't got such issues. Let me quote that they points in this RFC as good algorithm:

A GOOD IMPLEMENTATION:

   BIND (we looked at versions 4.8.3 and 4.9) implements a good
   retransmission algorithm which solves or limits all of these
   problems.  The Berkeley stub-resolver queries servers at an interval
   that starts at the greater of 4 seconds and 5 seconds divided by the
   number of servers the resolver queries. The resolver cycles through
   servers and at the end of a cycle, backs off the time out
   exponentially.

   The Berkeley full-service resolver (built in with the program
   "named") starts with a time-out equal to the greater of 4 seconds and
   two times the round-trip time estimate of the server.  The time-out
   is backed off with each cycle, exponentially, to a ceiling value of
   45 seconds.

I see this good alogorithm is much more aggressive and resitant to UDP packet loss that Unbound.

Have I missed something?

wcawijngaards commented 5 months ago

More aggressive? But it is not. It is both slower and tries less. Unbounds default values are much faster and also a higher max. Also this is for a very old version, really.

In addition for being nice to upstream down servers unbound detects downtime for upstream servers.

Forza-tng commented 5 months ago

Isn't this the point?

The time-out is backed off with each cycle, exponentially, to a ceiling value of 45 seconds.

I take that as BIND will try the same server after max 45s?

catap commented 5 months ago

In addition for being nice to upstream down servers unbound detects downtime for upstream servers.

Which lead to unique issue -- this one.

catap commented 5 months ago

Isn't this the point?

The time-out is backed off with each cycle, exponentially, to a ceiling value of 45 seconds.

I take that as BIND will try the same server after max 45s?

and never reach 24h as that state before in the thread. Let me quote it:

The back off is exponential and goes to 24h easily and then stops querying the upstream, until it is up again.

wcawijngaards commented 5 months ago

Look, why are you quoting BIND's behaviour here?

wcawijngaards commented 5 months ago

So you talk about why it should be fixed, and I quoted a reference, but for a very old version of BIND that is quoted there. So, aggressive retries are considered a problem, right? Do you just like to make this some sort of email discussion thread? Perhaps other dns forums could be good for that.

Apart from that, it would be nice to actually fix bugs in unbound.

In this case the upstream server does not respond or very rarely. Perhaps set qname-minimisation: no and turn off dnssec. I am serious about it, even if you do not want that, because less lookups means less servfail opportunities.

But it is true that the current unbound timeout algorithm is entirely focused on not harming infrastructure. That means shutting off. And in this case, that is not desired, the idea from spam filters is to just blast at the upstream servers, apparently. In which case it is nice for unbound to not work, since that means it is the nicest server of the bunch. So that brings us to the misclassification issue, the failure to respond should perhaps not be classified as a lookup failure, as downtime. For that settings like min-rtt: 1000 and max-rtt: 1000 should make unbound ignore that signal, perhaps set it to min-rtt: 100 and max-rtt: 100, or even lower min-rtt: 20 and max-rtt: 20, for very fast retries. Make sure the time of the max is larger than PING otherwise replies cannot make it back to the server within the time interval. That would make unbound retry much more. But then resolution failures still happen. And I guess because of the number of timeouts, eg, failed lookup attempts for the query. But I do not see that as a problem that happens in the logs so far.

wcawijngaards commented 5 months ago

So it looks like the 'exceeded..' message for nxdomains for nameserver lookups, could actually be servfails due to timeouts and not actually nxdomains for those servers. Or perhaps they are. And then nxdomain is just a distraction where the actual issue, for that error message, is packet drop and nameserver address lookup.

But the code with that error message is introduced to stop a denial of service vulnerability. And that stays relevant, and I do not want to remove that protection for a case where the upstream just fails to respond a lot of the time.

catap commented 5 months ago

Look, why are you quoting BIND's behaviour here?

Because it was explain in link that you points as good algorithm.

Anyway, here we have an issue with algorithm which leads to an issue with two setups.

Email server which uses Unbound as local resolver to make DNS lookup for white and black lists. As was pointed in begining, Unbound may lead to ban upstream server up to 24h that may lead to no email is deliverd for that time.
Personal laptop which is connected to internet via WiFi which is connected via unstable network. Or near the same setup when Unbound runs on router which is connected via modem with unstable network.

It is notable goal to not harm infrastructure, but I not sure that cost please wait minutes or more is wise. For example in the case of (2) and router, enduser decided to reboot it via power that invalidate all cache, that means Unbound will make more queries that it can, if cache stay intact.

For the case (1) it leads to put some magic constant in config which probably will stay here forewer, or move away from unbound to antoehr resolver which isn't installed by default by OS.

Also, unbound migth be used as embeded resolver for different software and such default behaviour isn't that anyone expects, let be honest. Good example of such software is OpenBSD's unwind: https://man.openbsd.org/unwind.8

catap commented 5 months ago

So it looks like the 'exceeded..' message for nxdomains for nameserver lookups, could actually be servfails due to timeouts and not actually nxdomains for those servers. Or perhaps they are. And then nxdomain is just a distraction where the actual issue, for that error message, is packet drop and nameserver address lookup.

But the code with that error message is introduced to stop a denial of service vulnerability. And that stays relevant, and I do not want to remove that protection for a case where the upstream just fails to respond a lot of the time.

Am I reading it right that NXDOMAIN not nessesary means NXDOMAIN, but it may be somethign else?

wcawijngaards commented 5 months ago

Yes, and particular for this error message only, as it was initially due to a denial of service vulnerability involving only nxdomains. It means that a lookup for a nameserver address did not result in an address.

catap commented 5 months ago

Well, with settings:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    infra-cache-min-rtt: 1000
    infra-cache-max-rtt: 1000

    outbound-msg-retry: 10
    max-sent-count: 128

I can't reproduce issue when the first request after restart leads to SERVFAIL.

Forza-tng commented 5 months ago

@wcawijngaards At least for me, the unexpected behaviour was the very long lockout time, not that there was a SRVFAIL if an NS didn't respond on every query.

I can imagine that very busy email servers could potentially cause a storm of DNS lookups, and minimising that is a good thing. The reality is though that we have huge amounts of spam and many antispam services rely on dns lookups like we have in this thread.

CompuRoot commented 5 months ago

But the code with that error message is introduced to stop a denial of service vulnerability. And that stays relevant, and I do not want to remove that protection for a case where the upstream just fails to respond a lot of the time.

May I ask you, what do you think about such situation: you get a SUV car to drive WHENEVER you want and one day this car will stop and throw a message to a head unit: "Sorry, you can't go over this road, it too busy and it has being hit so hard by OTHER trucks, so I want to help that road get on its feet, so wait till it will pawed with a perfect shining asphalt, that's the only road I will ALLOW you to drive"

catap commented 5 months ago

Changes https://github.com/NLnetLabs/unbound/issues/362#issuecomment-2107802231 allows to move to a few errors in 18h. So, I've increased a bit more and deployed:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    infra-cache-min-rtt: 1000
    infra-cache-max-rtt: 1000

    outbound-msg-retry: 32
    max-sent-count: 128

gthess commented 5 months ago

With infra-cache-max-rtt: 1000 and max-sent-count: 128 we are looking at a possible resolution time of ~128 seconds where Unbound could get the answer in the last try or generate a SERVFAIL because upstream is not responding. What does the client of said query do during those 128 seconds?

catap commented 5 months ago

@gthess the issue that it never runs 2 minutes. I have a 5 errors for the last 24h and here an example:

May 15 10:52:25 mx2 smtpd[41279]: 13152876d9fd9318 smtp connected address=185.228.233.91 host=hielton.wiki
May 15 10:52:49 mx2 smtpd[63549]: dnsbl: 13152876d9fd9318 DNS error 2 on dnsbl.sorbs.net
May 15 10:52:49 mx2 smtpd[41279]: 13152876d9fd9318 smtp disconnected reason=disconnect

which means that unbound spend no more than 24 seconds before return SERVFAIL that shouldn't be possible with this settings, isn't it?

The unbound statistics from that machine:

thread0.num.queries=27964
thread0.num.queries_ip_ratelimited=0
thread0.num.queries_cookie_valid=0
thread0.num.queries_cookie_client=0
thread0.num.queries_cookie_invalid=0
thread0.num.cachehits=13528
thread0.num.cachemiss=14436
thread0.num.prefetch=0
thread0.num.queries_timed_out=0
thread0.query.queue_time_us.max=0
thread0.num.expired=0
thread0.num.recursivereplies=14436
thread0.requestlist.avg=15.9454
thread0.requestlist.max=73
thread0.requestlist.overwritten=0
thread0.requestlist.exceeded=0
thread0.requestlist.current.all=0
thread0.requestlist.current.user=0
thread0.recursion.time.avg=0.671651
thread0.recursion.time.median=0.0493207
thread0.tcpusage=0
total.num.queries=27964
total.num.queries_ip_ratelimited=0
total.num.queries_cookie_valid=0
total.num.queries_cookie_client=0
total.num.queries_cookie_invalid=0
total.num.cachehits=13528
total.num.cachemiss=14436
total.num.prefetch=0
total.num.queries_timed_out=0
total.query.queue_time_us.max=0
total.num.expired=0
total.num.recursivereplies=14436
total.requestlist.avg=15.9454
total.requestlist.max=73
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.671651
total.recursion.time.median=0.0493207
total.tcpusage=0
time.now=1715764794.756138
time.up=89389.421574
time.elapsed=89389.421574

gthess commented 5 months ago

SERVFAIL can mean a lot of things. It indicates that Unbound itself is SERVFAILing and cannot continue resolving for reasons. One of the reasons could be the inability to reach upstream servers, but with your configuration maybe that is not the case for this example. If you have logs for that we can have a closer look.

Trying to get rid of all SERVFAILs from Unbound is not realistic because the probability for the Internet to be broken at a given time is high.

The only issue I see in your case is Unbound blacklisting specific domains as unresolvable over a period of upstream breakage. That is solved with a lower value of infra-cache-max-rtt (lower than the default 120000, I would use 5000) and infra-keep-probing. You would still get SERVFAILs if the upstream is not responding but they should be handled by the client.

But my question still holds. For the "upstream unavailable" scenario, what would the spam client do with such a long delay of response with your configuration?

catap commented 5 months ago

SERVFAIL can mean a lot of things. It indicates that Unbound itself is SERVFAILing and cannot continue resolving for reasons. One of the reasons could be the inability to reach upstream servers, but with your configuration maybe that is not the case for this example. If you have logs for that we can have a closer look.

I agree with that. Also, dnsbl.sorbs.net which is failing uses 15 (!) different NS. My configuration uses a lot of different DNS black and white lists (about 40 I guess) and this one produces majority of issues. For last 24h only this one leads to SERVFAIL, all another was OK.

So, I had enabled logs, let see how it turns.

Trying to get rid of all SERVFAILs from Unbound is not realistic because the probability for the Internet to be broken at a given time is high.

Sure, but if internet is broken it should fails on different lists, not on specific one. Also, I run two mx which is in different countries, which uses the same software and has the same issues with that DNS list.

Thus, I may switch to powerdns-recursor instead unbound on the same machines, to compare how it behaves.

The only issue I see in your case is Unbound blacklisting specific domains as unresolvable over a period of upstream breakage. That is solved with a lower value of infra-cache-max-rtt (lower than the default 120000, I would use 5000) and infra-keep-probing. You would still get SERVFAILs if the upstream is not responding but they should be handled by the client.

But my question still holds. For the "upstream unavailable" scenario, what would the spam client do with such a long delay of response with your configuration?

A client is OpenSMTPD filter which blocks SMTP session. Delay for minute or two is ok for normal SMTP servers. It waits. If it need wait too long, it gives up and retries to delivery email in 5-15-60 minutes.

gthess commented 5 months ago

A client is OpenSMTPD filter which blocks SMTP session. Delay for minute or two is ok for normal SMTP servers. It waits. If it need wait too long, it gives up and retries to delivery email in 5-15-60 minutes.

So I believe by just tweaking infra-cache-max-rtt and infra-keep-probing should work and prevent Unbound from marking the upstreams as unresponsive. And then let OpenSMTPD retry in 5-15-60 as configured.

Then it is the upstream's responsibility to keep their service running.

If you want to tweak further with outbound-msg-retry and max-sent-count to keep the delay to a maximum that could work in this case but I expect slowness in other generic resolution attempts.

catap commented 5 months ago

Then it is the upstream's responsibility to keep their service running.

I still think that it is unbound bug which we're facing here. I do run local DNSBL on the same machine, and just one of servers hit SERVFAL on resolving inside local machine.

From mail server point of view it was:

May 15 13:26:46 mx2 smtpd[41279]: 131528ddc2a95092 smtp connected address=192.124.216.130 host=free.gbnhost.com
May 15 13:27:09 mx2 smtpd[63549]: dnsbl: 131528ddc2a95092 DNS error 2 on wl.local
May 15 13:27:10 mx2 smtpd[63549]: dnsbl: 131528ddc2a95092 DNS error 2 on rbl.0spam.org
May 15 13:27:10 mx2 smtpd[41279]: 131528ddc2a95092 smtp disconnected reason=disconnect

let me get the logs between that time: unbound.log.gz

Thus, the second strange things that it gave up in less than 30 seconds, with so agressive settings, isn't it?

gthess commented 5 months ago

I may have found something, can you try with infra-cache-max-rtt: 2000 to see if it still happening while I dig around?

catap commented 5 months ago

@gthess sure, just deployed.

CompuRoot commented 5 months ago

Then it is the upstream's responsibility to keep their service running.

That's what I trying to say multiple times, it is their responsibility to handle load and if their DNS servers fell on its knee, it is their problem, not a DNS client to help them. They charge for their service, as well getting free graph of communications and should improve service on its own.

Unbound can't safe the world by playing nice to remote servers. It is a tens lines of code to make custom DNS client that defiantly won't never play nice to remote DNS servers, but the only unbound trying to ban its own users in attempt to safe providers who won't invest in their infrastructure and implement DDoS mitigation, my point is - no need to help others whom unbound can't control.

If one send delivery boy to a store to get a food and that store was temporary closed, it isn't delivery boy responsibility to give up and stop checking when store get opened, he should still continue knocking on a door till he get what he was send to get... otherwise other delivery boys will be hired

catap commented 5 months ago

I may have found something, can you try with infra-cache-max-rtt: 2000 to see if it still happening while I dig around?

Another strange things just happened (with requested changes):

May 15 14:39:40 mx2 smtpd[41279]: 131528fdc9d216fa smtp connected address=146.185.239.142 host=nourishnotion.online
May 15 14:39:40 mx2 smtpd[63549]: dnsbl: 131528fdc9d216fa DNS error 2 on b.barracudacentral.org
May 15 14:39:40 mx2 smtpd[41279]: 131528fdc9d216fa smtp disconnected reason=disconnect

Here all logs for that second: unbound.log.gz

catap commented 5 months ago

Unbound can't safe the world by playing nice to remote servers. It is a tens lines of code to make custom DNS client that defiantly won't never play nice to remote DNS servers, but the only unbound trying to ban its own users in attempt to safe providers who won't invest in their infrastructure and implement DDoS mitigation, my point is - no need to help others whom unbound can't control.

Things is worst here. I'm as a programmer expect that when I call gethostname() I'll get the response. It maith be blocked for a while, but response will be here. And I won't try to overstep resolver logic to try one more time to get it, I run that call with assumption that resolver did the best that he can.

Nginx and Varnish and OpenBSD's relayd, all of them uses gethostname() to lookup the upstream IP address on start. And if they can't, they simple won't start.

Current unbound logic means that if I use it on my server, after reboot nginx/varnish/relayd/something else won't start because Unbound decided to gave up.

mnordhoff commented 5 months ago

Any y'all contacted sorbs.net about their seemingly broken and inconsistently-configured DNS?

https://dnsviz.net/d/sorbs.net/ZkS6vA/dnssec/ https://dnsviz.net/d/dnsbl.sorbs.net/ZkS83w/dnssec/

catap commented 5 months ago

Any y'all contacted sorbs.net about their seemingly broken and inconsistently-configured DNS?

https://dnsviz.net/d/sorbs.net/ZkS6vA/dnssec/ https://dnsviz.net/d/dnsbl.sorbs.net/ZkS83w/dnssec/

DNSSEC is usually broken for DNS list zones, for example https://dnsviz.net/d/combined.mail.abusix.zone/dnssec/ and it works well.

CompuRoot commented 5 months ago

Any y'all contacted sorbs.net about their seemingly broken and inconsistently-configured DNS?

https://dnsviz.net/d/sorbs.net/ZkS6vA/dnssec/ https://dnsviz.net/d/dnsbl.sorbs.net/ZkS83w/dnssec/

Again, it is - THEIR problem! The problem we discussing here is the client, the unbound, who took voluntarily role of judge and be world saver by refusing what it must do - resolving, regardless how bad partners that it must to deal with.

BTW, barracuda: https://dnsviz.net/d/barracudacentral.org/ZakLGg/dnssec/ here but still periodically banned by unbound as SERVFAIL

mnordhoff commented 5 months ago

DNSSEC is usually broken for DNS list zones, for example https://dnsviz.net/d/combined.mail.abusix.zone/dnssec/ and it works well.

DNSSEC should not be broken for any zone. Of course, many zones have it off. Despite the URL, DNSViz shows many problems that are not related to DNSSEC.

CompuRoot commented 5 months ago

Current unbound logic means that if I use it on my server, after reboot nginx/varnish/relayd/something else won't start because Unbound decided to gave up.

Not only servers, but border gateway/firewall/routers. As an example, - popular projects pfSense/opnSense where unbound used as a resolver by default. Implementing there DNS based filters leads to constant banning of resources that shouldn't be, and the root of the problem is the same, DNS client logic

catap commented 5 months ago

Not only servers, but border gateway/firewall/routers. As an example, - popular projects pfSense/opnSense where unbound used as a resolver by default. Implementing there DNS based filters leads to constant banning of resources that shouldn't be, and the root of the problem is the same, DNS client logic

do not forget that it has libunbound which is used as embeded resolver which makes an impact much larger.

catap commented 5 months ago

@mnordhoff meanwhile I have another example when .local zone returns the error:

May 15 16:26:31 mx2 smtpd[41279]: 1315294d21d24725 smtp connected address=168.245.125.197 host=o1361.shared.klaviyomail.com
May 15 16:26:43 mx2 smtpd[63549]: dnsbl: 1315294d21d24725 DNS error 2 on bl.local
May 15 16:26:43 mx2 smtpd[63549]: dnsbl: 1315294d21d24725 DNS error 2 on bl.spamcop.net
May 15 16:27:03 mx2 smtpd[63549]: dnsbl: 1315294d21d24725 DNS error 2 on dnsbl.dronebl.org
May 15 16:27:04 mx2 smtpd[63549]: dnsbl: 1315294d21d24725 DNS error 2 on mail-abuse.blacklist.jippg.org
May 15 16:27:05 mx2 smtpd[63549]: dnsbl: 1315294d21d24725 DNS error 2 on v4.pofon.foobar.hu
May 15 16:27:07 mx2 smtpd[41279]: 1315294d21d24725 smtp tls ciphers=TLSv1.3:TLS_AES_256_GCM_SHA384:256
May 15 16:27:11 mx2 smtpd[41279]: 1315294d21d24725 smtp message msgid=f4ec4f16 size=44081 nrcpt=1 proto=ESMTP
May 15 16:27:11 mx2 smtpd[41279]: 1315294d21d24725 smtp envelope evpid=f4ec4f1616ed2a9e ...
May 15 16:27:12 mx2 smtpd[41279]: 1315294d21d24725 smtp disconnected reason=quit

It is running right now with infra-cache-max-rtt: 2000. The logs: unbound.log.gz

gthess commented 5 months ago

My reading of the code was incorrect previously. Setting infra-cache-max-rtt to low values is asking for trouble since this is the cap of the dynamic RTT calculation. Servers reaching that are deemed not usable. These are the strange results you are seeing in the latest logs. For your case I would try to use infra-host-ttl: 5 so that servers are retried roughly when the 2 second timeout (together with the previous incremental tries) is reached. So I would use:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    # infra-cache-min-rtt: 1000
    infra-cache-max-rtt: 2000
        infra-host-ttl: 5  # this is in seconds

    # outbound-msg-retry: 32
    # max-sent-count: 128

If you are only dealing with timeouts there is no point in changing outbound-msg-retry and max-sent-count since timeouts are tracked independently of those counters and have way lower values than what is configured here (those counters are not solely for timeouts).

You can try the above configuration and if it works try to bump infra-host-ttl to higher values as 5 seconds is quite aggressive. You lose information from all your upstreams and Unbound has to discover RTTs again.

catap commented 5 months ago

@gthess before move your way I'd like give unbound benefit of the doubt, and installed on both tested servers powerdns-recursor. I've used default config from OpenBSD with one changes:

forward-zones=local=127.0.0.2

to allow using local rbldnsd.

The first impression with comparing with unbound it simple works. I don't need to spend time to debug to understand that I should add to config something similar to:

    domain-insecure: "local."
    private-domain: "local."
    do-not-query-localhost: no

So, let see how it works in the real life.

catap commented 5 months ago

The first impression with comparing with unbound it simple works. I don't need to spend time to debug to understand that I should add to config something similar to:
  domain-insecure: "local."
  private-domain: "local."
  do-not-query-localhost: no
So, let see how it works in the real life.

and in couple of minutes it start to complain for DNSSEC for that domain. I'm too lazy to dig it tonoght, so, deploying settings from https://github.com/NLnetLabs/unbound/issues/362#issuecomment-2115159305

catap commented 5 months ago

You can try the above configuration and if it works try to bump infra-host-ttl to higher values as 5 seconds is quite aggressive. You lose information from all your upstreams and Unbound has to discover RTTs again.

Suggested configuration decreased number of errors for about two times. Deployed:

    aggressive-nsec: no
    qname-minimisation: no

    infra-keep-probing: yes
    infra-cache-max-rtt: 2000
    infra-host-ttl: 10

catap commented 5 months ago

No more DNS errors on my setup. I've increased as infra-host-ttl: 15.

catap commented 5 months ago

No more DNS errors on my setup. I've increased as infra-host-ttl: 15.

which brings errors back. Try run it witn infra-host-ttl: 10 for a while without touching it

catap commented 5 months ago

Something is still puzzeling me:

May 20 22:32:26 mx2 smtpd[35363]: e1179a8125981021 smtp connected address=199.185.178.25 host=mail.openbsd.org
May 20 22:32:26 mx2 smtpd[92952]: dnsbl: e1179a8125981021 DNS error 2 on dnsbl.spfbl.net
May 20 22:32:30 mx2 smtpd[35363]: e1179a8125981021 smtp tls ciphers=TLSv1.3:TLS_AES_256_GCM_SHA384:256

why Unbound replies almost imeditly, and do not makes any retry?

gthess commented 5 months ago

If all upstreams are known to be bad Unbound returns a SERVFAIL straight away; this is where the low infra-host-ttl helps with forcing Unbound to refresh data more frequently. With the infra-cache-max-rtt set at 2 seconds it is easy for an upstream to reach that limit. Maybe you should consider raising that value. You can check with unbound-control dump_infra and see what Unbound has stored for each known upstream. rtt is the measured value, rto is the value with Unbound's exponential back off applied.

catap commented 5 months ago

Seems that all wow effect was due to less traffic on weeked. Right now errors back. And it is mach worst than with settings:

    infra-keep-probing: yes
    infra-cache-max-rtt: 2000
    infra-host-ttl: 10
    qname-minimisation: no
    aggressive-nsec: no

Worst means that before I have something between 10-20 errors per 24h, right now for 18h of logs I do have more than 50. and all of them has "spike" behaviour. It isn't one error, it is a few, like 2-5 in the near minute. For example errors for the last 30 minutes looks like:

May 21 11:01:41 mx1 smtpd[44773]: dnsbl: 0ef9c3c7d738da68 DNS error 2 on bl.spamcop.net
May 21 11:03:27 mx1 smtpd[44773]: dnsbl: 0ef9c3ccca541c17 DNS error 2 on bl.spamcop.net
May 21 11:03:43 mx1 smtpd[44773]: dnsbl: 0ef9c3d1c13e0ff7 DNS error 2 on b.barracudacentral.org
May 21 11:03:44 mx1 smtpd[44773]: dnsbl: 0ef9c3ccca541c17 DNS error 2 on b.barracudacentral.org

and the SMTP session was:

May 21 11:03:38 mx1 smtpd[27969]: 0ef9c3d1c13e0ff7 smtp connected address=162.243.37.58 host=upt.c.new-york-f6bb7d19
May 21 11:03:43 mx1 smtpd[44773]: dnsbl: 0ef9c3d1c13e0ff7 DNS error 2 on b.barracudacentral.org
May 21 11:03:43 mx1 smtpd[27969]: 0ef9c3d1c13e0ff7 smtp tls ciphers=TLSv1.3:TLS_AES_256_GCM_SHA384:256
May 21 11:03:43 mx1 smtpd[27969]: 0ef9c3d1c13e0ff7 smtp disconnected reason=quit

This mean that in no more than 5 seconds after connect unbound decided that it should return SERVFAIL which seems quite odd, isn't it?

Anyway, an output of dump is here:

168.215.181.4 gbudb.net. expired rto 2000
113.52.8.50 dnsbl.sorbs.net. expired rto 2000
185.136.98.88 catap.net. ttl 10 ping 0 var 72 rtt 288 rto 288 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
185.136.99.88 catap.net. ttl 4 ping 0 var 72 rtt 288 rto 288 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
211.136.17.105 chinamobile.com. expired rto 2000
194.134.35.168 dnsbl.sorbs.net. expired rto 2000
192.55.83.30 net. ttl 10 ping 1 var 74 rtt 297 rto 297 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
185.136.96.88 catap.net. ttl 4 ping 0 var 56 rtt 224 rto 224 tA 0 tAAAA 0 tother 0 ednsknown 1 edns 0 delay 0 lame dnssec 0 rec 0 A 0 other 0
74.208.146.124 dnsbl.sorbs.net. expired rto 2000
168.215.181.6 gbudb.net. expired rto 2000
199.30.59.107 justspam.org. expired rto 2000
87.106.246.154 dnsbl.sorbs.net. expired rto 2000
89.150.195.2 dnsbl.sorbs.net. expired rto 2000

So, I've deployed:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    infra-cache-max-rtt: 2000

    outbound-msg-retry: 32
    max-sent-count: 128

gthess commented 5 months ago

This mean that in no more than 5 seconds after connect unbound decided that it should return SERVFAIL which seems quite odd, isn't it?

I don't think it is odd, the max rtt is 2 seconds and with the retries for timeout the cap is reached and the server is considered down. So i guess what happened is queries went out with timeouts of 376 (default if Unbound has no server information) , 752, 1504. They all timed out and the last one would make the next timeout to be over the configured max (2000) and instead capped at 2000. Then the server is considered down for further queries until reprobed at least after infra-host-ttl.

That is why I believe setting outbound-msg-retry and max-sent-count would not make any significant changes for timeouts.

The errors are not a problem as I understand it, each upstream server will be retried at least every 10 seconds (infra-host-ttl). For these 10 seconds servers that are marked as down will likely stay down for Unbound.

OpenSMTPD will then get to retry after 5-15-60 as you shared. Likely in the next 5 minutes when the first retry would happen the upstream could be up again.

So a low infra-host-ttl value makes sure that Unbound does not consider a server down for long. With the default configuration (900 seconds) I guess the servers were considered down for the first two retries of OpenSMTPD and then the third and longer one, after an hour, would eventually succeed or not.

catap commented 5 months ago

The errors are not a problem as I understand it, each upstream server will be retried at least every 10 seconds (infra-host-ttl). For these 10 seconds servers that are marked as down will likely stay down for Unbound.

OpenSMTPD will then get to retry after 5-15-60 as you shared. Likely in the next 5 minutes when the first retry would happen the upstream could be up again.

The errors are the problem. Let me try again.

DNS is a critical part of the anti-spam infrastructure. It is used to check the sender's IP and domain against black and white lists.

MTA / LDA uses gethostbyname() to perform such queries which should return NXDOMAIN if record doesn't exist in list or IP address. Different lists have different policies regarding returned IP addresses, but that's not important here.

What is important is the fact that if DNS resolver decides to return an error, it leads to an edge case. Such an edge case has two possible outcomes depending on the system design.

Email is temporarily rejected and remote server will try to deliver it again. This may happen in 15 minutes, or 1 hour, or tomorrow, nobody knows how the remote system is configured.
Email is delivered with false-positive or false-negative, depending on which requests had errors: black or white.

Today's email is often used to login or confirm some action with code or link which has short TTL like 15 minutes and (1) makes it almost impossible to use.

From another hand (2) leads to sneaky spam and fraud emails, and at least annoys users or creates potential security issues.

So, DNS errors are an issue here.

Everyone understands that DNS might be unstable and very small level of errors can happen, but it should be exceptional case and not usual and expected behavior.

Frankly speaking, current unbound logic makes gethostbyname() absolutely untrustworthy, and on application error someone should re-run it multiple times to be absolutely sure that remote server really doesn't work, and not DNS resolver decided to rest.

Such behavior leads to redesign extreadianly amount of software to be compliant with unbound behavior which seems quite wired, special that gethostbyname() may block application for quite long time to get the valid request.

catap commented 5 months ago

That is why I believe setting outbound-msg-retry and max-sent-count would not make any significant changes for timeouts.

Anyway, as soon as I deployed it, no more errors happened. And right now I've added infra-host-ttl: 0 and have:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    infra-cache-max-rtt: 2000
    infra-host-ttl: 0

    outbound-msg-retry: 32
    max-sent-count: 128

CompuRoot commented 5 months ago

OpenSMTPD will then get to retry after 5-15-60 as you shared.

How a receiver can ask for retry? All of these antispam filtering happened during handshake with remote MTA. The problem isn't on sending but on receiving email by MTA. Receiver uses DNS to classify remote connection to compare it with multiple antispam databases and make decision either it should be rejected immediately or accept connection and mark as spam or classify it as legit email and accept it inbox. 5-15-60 is related to "greylisting" by replaying to remote with error 451, which delay email delivery for no reason, due to DNS resolver failed.

catap commented 5 months ago

5-15-60 is related to "greylisting" by replaying to remote with error 451, which delay email delivery for no reason, due to DNS resolver failed

Not only greylisting, such an error can happen due to a filter doing RBL lookups and rejecting with 451 error code if DNS had issues for example.

Either way, it leads to greylisting-like delays that are not under anyone's control and are consequences of the Unbound logic.

catap commented 5 months ago

Just a bit of statistics. The settings:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    infra-cache-max-rtt: 2000
    infra-host-ttl: 0

    outbound-msg-retry: 32
    max-sent-count: 128

leads to 9 DNS errors in 2298 SMPT connections. For example suggested settings:

    infra-keep-probing: yes
    infra-cache-max-rtt: 2000
    infra-host-ttl: 10
    qname-minimisation: no
    aggressive-nsec: no

leads to more than 5x of number of DNS errros for near the same amount of SMTP connections.

it isn't ideal, and means that I do have ~0.4% false positive or negative result of antispam filtering, but it's better than before, indeed.

So, if @gthess or @halderen has future suggesting, I'd love to test it.

gthess commented 5 months ago

OK the part I didn't get the first time is that I though the retries were coming from the local OpenSMTPD to check for spam, thanks for clarifying.

So if I understand correctly you would like Unbound to get stuck on not responding upstreams while they become available again. So that such errors are likely not produced while all the upstreams are unavailable. Then I think leaving infra-cache-max-rtt at the default value (12 seconds) or even higher if you want, raising infra-cache-min-rtt: 2000 (so that Unbound waits at least 2 seconds for a single timeout) and using infra-host-ttl: 5 and infra-keep-probing: yes would achieve the same.

Setting infra-host-ttl: 0 will always make Unbound to start form scratch while talking to servers that are a bit slower than the default 376msec to respond.

So you could try and see :)

Btw I still believe that DNS errors are OK and should be handled. If a service is down there is no way around it. Now you have configured Unbound to get stuck on non responsive upstreams in the hopes of them being back up while you hold on to a query.

(Also heads up when you update to 1.20.0 as new options were introduced that may fail your solution namely discard-timeout and wait-limit. You may want to turn those off or tweak them if your Unbound is not an open-ish resolver.)

Back to the issue, the only issue I see with this email scenario is that Unbound may keep a whole zone down for long because all the upstreams are unresponsive. Maybe we need a change when all the upstreams for a certain zone are considered down but I have to think about it.

catap commented 5 months ago

So if I understand correctly you would like Unbound to get stuck on not responding upstreams while they become available again. So that such errors are likely not produced while all the upstreams are unavailable.

Well, from another hand better to give up on such request after large timeout is happened, like couple of minutes. Otherwise it is a kind of DDoS for Unbound.

Setting infra-host-ttl: 0 will always make Unbound to start form scratch while talking to servers that are a bit slower than the default 376msec to respond.

Yeah, it was a kind of a move to make it dumb and for now it has the best outcome BTW.

So you could try and see :)

Moved to settings:

    qname-minimisation: no
    aggressive-nsec: no

    infra-keep-probing: yes
    infra-cache-min-rtt: 2000
    infra-host-ttl: 5

let see how it goes.

Btw I still believe that DNS errors are OK and should be handled. If a service is down there is no way around it. Now you have configured Unbound to get stuck on non responsive upstreams in the hopes of them being back up while you hold on to a query.

I've just checked Linux and OpenBSD man pages about gethostbyname and gethostbyaddr. Such syscall may return TRY_AGAIN with explanation:

This is usually a temporary error and means that the local server did not receive a response from an authoritative server. A retry at some later time may succeed.

So, the question about the fraction of that errors. The case of DNS resolver for antispam has one limitation: here no "some time later". Usuall SMTP session lasts couple of minutes, and I simple can't add amy retry inside the code, because I probably hasn't got any time left to do it.

So, if we have a few errors per day, I guess better to ignore it and accept very small fraction of false positive or negative decision.

If it has larger quantity from one of DNS zone, better to call this zone broken and switch it off for a while or completley.

Thus, I see one more case when such behaviour of Unbound leads to an issue.

Let assume that we run Nginx, Varnish or any other proxy-server which contains upstream server with DNS name. Such proxy server make DNS resolving on start, and if DNS fails, it won't start. And this isn't edge case, I'd like to call it a kind of common case because guys from "bot protection industry" usually ask to add to the reverse-proxy their upstream as DNS zone.

(Also heads up when you update to 1.20.0 as new options were introduced that may fail your solution namely discard-timeout and wait-limit. You may want to turn those off or tweak them if your Unbound is not an open-ish resolver.)

Thanks, noted in the comments of my config. OpenBSD had upgraded Unbound to 1.19.3, and probably I'll face that issue in the year or two.

Thus, I really think that such setup for Unbound which we're discussing here should be at least documented to avoid guys from spending a lot of their and your time to figure out wtf, and probably keep some users with unbound.

Anyway, I plan to document and keep updated my mail setup as soon as it stable enough, but I really think that such setup isn't unique, and should be documented inside Unbound as well.

NLnetLabs / unbound

exceeded the maximum nameserver nxdomains #362