NLnetLabs / unbound

Unbound is a validating, recursive, and caching DNS resolver.
https://nlnetlabs.nl/unbound
BSD 3-Clause "New" or "Revised" License
3.06k stars 349 forks source link

configured forward zone occasionally not resolved through defined forward-addr server #451

Open bjacke opened 3 years ago

bjacke commented 3 years ago

I have a forwarding zone defined like this:

domain-insecure: "private" forward-zone: name: "example.private" forward-addr: 10.0.0.10 forward-addr: 10.0.0.20

this is working well most of the time but sometimes DNS queries from the example.private zone trigger unbomund to resolve the name not via the forwarding DNS servers but they are being queried upstream resulting in NXDOMAIN replies. After an hour or so the forwarder is queried correctly again. A network trace also shows that the forwarder was not asked at all when the NXDOMAIN issue starts to pop up. This happened with unbomun from Deban stetch and also with Debian buster, which ships unbound 1.9.0. I couldn't find any bug report or fix in a later version addressing the issue that I see.

wcawijngaards commented 3 years ago

If you had forward-first enabled, then this would be exactly that option. Otherwise I also do not see immediately a bugfix for this, could you try with a more up to date version, and could you provide a verbosity 4 logs of what happens when it goes wrong?

The fact that it waits for an hour is likely because the forwarder addresses become unreachable, and are marked as such in the infra cache. It then waits for the infra cache entries to time out before asking those forwarders again. This means the forwarders are failing to reply for long periods. When all the forwarders fail it should reply with that it failed, servfail, to the client. The client may then try again to a different DNS resolver, by the way.

bjacke commented 3 years ago

forward-first was not used there. We've set forward-no-cache: yes there for now, hoping that this mitigates the problem so that the NXDOMAIN will at least not persist for a longer time. We'll increase the verbosity level and I will keep you updated if we find out anything new here.

ciroiriarte commented 3 years ago

Hello!,

I believe I'm seeing the same issue with OPNSense virtual appliance and unbound 1.13.1. I've setup what they call "domain override" which basically forwards all the queries for a domain to specific servers.

Then I dropped that configuration and moved to a stub domain configuration (as an advanced/by-hand setup not using their GUI):

`server:

Allow private IPs in the answers

    private-domain: "ipa.domain.lab."
    # Skip DNSSEC validation for this domain
    domain-insecure: "ipa.domain.lab."

stub-zone:

propper delegation

    name: "ipa.domain.lab."
    stub-addr: 10.X.Y.21
    stub-addr: 10.X.Y.22
    stub-addr: 2803:A:B:C:1:ffff:0:21
    stub-addr: 2803:A:B:C:1:ffff:0:22
    stub-first: no

`

The behaviour didn't change, it works until it doesn't. If I restart the service, it works again for some time.

https://github.com/opnsense/core/issues/5121

ciroiriarte commented 3 years ago

forward-first was not used there. We've set forward-no-cache: yes there for now, hoping that this mitigates the problem so that the NXDOMAIN will at least not persist for a longer time. We'll increase the verbosity level and I will keep you updated if we find out anything new here.

did the workaround work for you?, just found "stub-no-cache". Will try that for my stub configuration and see if it helps (giving up cache & hitting always the authoritative)

bjacke commented 3 years ago

did the workaround work for you?, just found "stub-no-cache". Will try that for my stub configuration and see if it helps (giving up cache & hitting always the authoritative)

it was at a customer's site. I didn't hear back from the problem. So I can only assume that forward-no-cache fixed it.

wcawijngaards commented 3 years ago

I notice the stub and forward both used. If you used a stub zone, and unbound received a delegation, NS records, from the server, unbound would then use those NS records to fetch data from, for the duration of that TTL. But if you use a forward zone, unbound continues to ask those forward servers for the information. Use of a stub when you wanted a forward? But then another user says forward-no-cache fixes it. What the no-cache options do is force unbound to use recursion instead of the cache every time.

vegaaz commented 2 years ago

Hi I am experiencing the same issue. The problem occurs in pfSense Version 2.6.0.

Trying the 'forward-no-cache' option... Thanks for the advices!

Edit: The 'forward-no-cache' option did not fix the behavior on my site. Trying as 'stub-zone'... :)

msoltyspl commented 2 years ago

We noticed probably the same issue with unbound while trying to upgrade from 1.6.0 (old version in debian stretch - where with identical config we have never had any issues) to 1.13.1 (version in debian bullseye at the moment) - and were blocked in the process, as we started getting random NXDOMAIN replies after unbound suddenly decides to ignore forward-addr: and resolve the names itself.

Relevant part of configuration (also no stub-zones under that name):

server:
  domain-insecure: "packet.service.internal"
forward-zone:
  name: "packet.service.internal"
  forward-addr: 10.190.10.1
  forward-addr: 10.190.0.2

Example of such behavior:

Jul 21 15:22:14 unbound[1009:2] query: 10.13.37.254 ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new
Jul 21 15:22:14 unbound[1009:2] info: subnet operate: query ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass
Jul 21 15:22:14 unbound[1009:2] info: validator operate: query ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
Jul 21 15:22:14 unbound[1009:2] info: resolving ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] info: finishing processing for ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone
Jul 21 15:22:14 unbound[1009:2] info: validator operate: query ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] info: validate(nxdomain): sec_status_secure
Jul 21 15:22:14 unbound[1009:2] info: validation success ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
Jul 21 15:22:14 unbound[1009:2] info: subnet operate: query ipa01.packet.service.internal. A IN
Jul 21 15:22:14 unbound[1009:2] reply: 10.13.37.254 ipa01.packet.service.internal. A IN NXDOMAIN 0.000000 0 131

It works fine for a bit after restarting unbound:

Jul 21 15:08:35 unbound[1009:0] query: 10.13.37.254 ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_new
Jul 21 15:08:35 unbound[1009:0] info: subnet operate: query ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass
Jul 21 15:08:35 unbound[1009:0] info: validator operate: query ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
Jul 21 15:08:35 unbound[1009:0] info: resolving ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] info: processQueryTargets: ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] info: sending query: ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] debug: sending to target: <packet.service.internal.> 10.190.10.1#53
Jul 21 15:08:35 unbound[1009:0] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply
Jul 21 15:08:35 unbound[1009:0] info: iterator operate: query ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] info: response for ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] info: reply from <packet.service.internal.> 10.190.10.1#53
Jul 21 15:08:35 unbound[1009:0] info: query response was ANSWER
Jul 21 15:08:35 unbound[1009:0] info: finishing processing for ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone
Jul 21 15:08:35 unbound[1009:0] info: validator operate: query ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
Jul 21 15:08:35 unbound[1009:0] info: subnet operate: query ovirt.infra.packet.service.internal. A IN
Jul 21 15:08:35 unbound[1009:0] reply: 10.13.37.254 ovirt.infra.packet.service.internal. A IN NOERROR 0.158096 0 186

The logs above are at verbosity: 3 with most log-* settings enabled.

I have verbosity:5 logs as well (though it's nearly 200mb of text from 4 minutes of unbound's activity), but maybe that would be of some help.

Notable thing - when I was trying to reproduce it on a separate machine and I was the only client using unbound on it, I was never able to reproduce it. When I swapped that machine to server as a primary dns - it happenes within a few seconds usually. So it would suggest it's somehow related to the queries and/or the amount of them.

I've peeked at changelog for versions between 1.13.1 and 1.16.1 but it doesn't look like there was anything specifically mentioning this issue.

Any ideas ?

wcawijngaards commented 2 years ago

The option harden-below-nxdomain is changed to default yes, if you turn it off, then NXDOMAIN for a higher label does not create an nxdomain for responses below that label. The higher NXDOMAIN then has to be dnssec signed. This is supposed to be fixed so that nxdomain synthesis does not happen above the stub or forward definition, with a fix listed in the changes for 1.13.2.

msoltyspl commented 2 years ago

Hmm, looking at other queries that we had in the logs that probably was the issue (e.g. .service.internal. - as these went straight to root servers).

I've tested manually compiled 1.16.1 - so far no issues. I'll check 1.13.1 with the above option turned off and report later.

Thanks for info.

msoltyspl commented 2 years ago

So to sum up I've tested 1.6.0 (early stretch), 1.13.1 (bullseye), 1.16.1 (manually compiled):

1.6.0 and 1.16.1 work fine, regardless if harden-below-nxdomain is or isn't enabled; 1.13.1 will prioritize higher label over forwarders.

Thanks for help.

For the record, not sure if my issue is the same as the one reported here - @bjacke would have to double check. Also interestingly @vegaaz mentioned pfSense 2.6.0 which IIRC has unbound in 1.13.2 (so version where that was fixed, as per changelog).

catap commented 7 months ago

Unbound-1.18.0 from OpenBSD. Config:

    domain-insecure: "sa31-home.catap.net."

stub-zone:
    name: "sa31-home.catap.net."
    stub-addr: 172.31.2.1

after reboot it does't work. After restart of unbound it works for some time, after that it responses SERVFAIL.

Inside logs it has as littel as:

Feb 25 23:57:20 matebook unbound: [78268:0] info: resolving XXX.sa31-home.catap.net. A IN
Feb 25 23:57:20 matebook unbound: [78268:0] info: use stub sa31-home.catap.net. NS IN
Feb 25 23:57:20 matebook unbound: [78268:0] info: use stub sa31-home.catap.net. NS IN
wcawijngaards commented 7 months ago

Servfail means something was wrong trying to fetch the answer, the logs should contain information, at say verbosity 4 or 5. Or enable log-servfail: yes, that prints one line, and also at lower verbosity levels. The information should indicate what went wrong, like the upstream server fails to respond. If the logs do not have information, perhaps the syslog is getting throttled due to log line count volume, it is possible to log to file to get also the filtered loglines, with like logfile: "/root/unbound.log".

catap commented 7 months ago

@wcawijngaards thanks for suggesting option log-servfail: yes, it allows to dig deper. Here the results:

~ $ host diskstation.sa31-home.catap.net 172.31.2.1
Using domain server:
Name: 172.31.2.1
Address: 172.31.2.1#53
Aliases: 

diskstation.sa31-home.catap.net has address 172.31.2.23
~ $ host diskstation.sa31-home.catap.net 127.0.0.1  
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

Host diskstation.sa31-home.catap.net not found: 2(SERVFAIL)
~ $ grep diskstation.sa31-home.catap.net /var/log/daemon                                                                                  
Feb 26 23:00:24 matebook unbound: [67157:0] error: SERVFAIL <diskstation.sa31-home.catap.net. A IN>: all the configured stub or forward servers failed, at zone sa31-home.catap.net. no server to query nameserver addresses not usable have no nameserver names
~ $ doas rcctl restart unbound                           
unbound(ok)
unbound(ok)
~ $ host diskstation.sa31-home.catap.net 127.0.0.1       
Using domain server:
Name: 127.0.0.1
Address: 127.0.0.1#53
Aliases: 

diskstation.sa31-home.catap.net has address 172.31.2.23
Host diskstation.sa31-home.catap.net not found: 2(SERVFAIL)
Host diskstation.sa31-home.catap.net not found: 2(SERVFAIL)
~ $ grep diskstation.sa31-home.catap.net /var/log/daemon 
Feb 26 23:00:24 matebook unbound: [67157:0] error: SERVFAIL <diskstation.sa31-home.catap.net. A IN>: all the configured stub or forward servers failed, at zone sa31-home.catap.net. no server to query nameserver addresses not usable have no nameserver names
Feb 26 23:01:49 matebook unbound: [54626:0] error: SERVFAIL <diskstation.sa31-home.catap.net. AAAA IN>: all the configured stub or forward servers failed, at zone sa31-home.catap.net. from 172.31.2.1 nodata answer
Feb 26 23:01:49 matebook unbound: [54626:0] error: SERVFAIL <diskstation.sa31-home.catap.net. MX IN>: all the configured stub or forward servers failed, at zone sa31-home.catap.net. no server to query nameserver addresses not usable have no nameserver names
~ $

to get it, I simple reboot the machine.

If I restart unbound after reboot, it works. For some time. After some time, it starts to reply SERVFAIL again to exited domain, and restart helps.

BTW, can I some how disable SERVFAIL on not-found replies?

wcawijngaards commented 7 months ago

I do not know about options to disable servfail. But for the error, so it says that the stub addr, that is 172.31.2.1 is failing. With verbosity 4 or 5 it would log in detail what happens when a query is sent there. From the commandline, it looks like the host command did send a query there, and also for some time after start. So perhaps enable the verbosity and then when it fails, see what it says. It must fail for some reason when sending packets to that IP address, like timeout or an error code from the system network call. This is quite similar to sending packets from the commandline, and also it works initially and then fails, so not sure what the failure reason is that happens after some time.

catap commented 7 months ago

I doubt that the remove host is an issue because it is reproduced each reboot, and restart after everything is booted fixes an issue.

Before I thought about remote host issue, and try to debug it quite hard, but I stop dig in that derection as soon as I understand that it is reproduced by each reboot, and fixed by restart unbound.

I took a liberty to sent to your email which you use in git commits the whole log with verbose 5 after reboot. It contains a kind of private information which I don't like to share publicly, I hope it ok. So, this log contains:

If it doesn't help I ready no capture traffic by tcpdump between this machine and DNS server, but let move step by step.

wcawijngaards commented 7 months ago

Thanks for the detailed logs in email!

The server configured for sa31-home.catap.net with ip 172.31.2.1, responds with lame answers. Unbound stops sending queries there and then all you get is servfail. What happens is that it responds with a delegation to catap.net. for queries that are sent to it. So the upstream server says it is not actually hosting the zone in question. For the query for 'diskstation.sa31-home.catap.net.' it answers later on. For local.sa31.. it gave the delegation. Then for the AAAA record for diskstation it gives the delegation again. It also responds with the 'ra' flag, so it is a recursor with a cache it seems.

So perhaps what happens is that after a while the delegation is asked, or returned, and then the information that the server is lame is cached by unbound. Unbound then refuses to send queries to the lame upstream. What could be wrong is that the server is configured as a stub-zone, but actually it is a cache, and should be configured as forward-zone with a forward-addr. If then the information falls out of the cache, because it times out, then as a stub zone it would respond with a higher referral; but as a forward-zone unbound would set the RD flag, causing it to fetch the content in cache again. Another issue could be that it is really there for catap.net and not for the sa31-home name, and then it should be configured as catap.net. With that change the delegation would not be upwards, and it would not be marked as lame. Most likely it is a cache and the forward clause is needed. Otherwise, if the server responds differently for different subzones, eg. local and diskstation, some of which exist or not, that would need separate entries as forward or stub zones, because of its upwards referrals. But that seems to say it does not host catap.net either, so my guess is that it is a cache and needs to be configured as a forward-zone.

catap commented 7 months ago

@wcawijngaards well, on another end DNS is Unifi GW which runs Dnsmaq which is configured as forward DNS proxy to upstream provider and announces local network DNS records from its /etc/hosts which is updated via DHCP server.

Used test domain diskstation.sa31-home.catap.net. is NAS which isn't rebooted and connected via Ethernets.

I still assume that something goes wrong with unbound, and I will make a few more tests with capturing traffic, that may explain wtf is going on.

Regarding stub vs forward zone: probably documentation wasn't clear which one should I use here. Frankly speaking it seems quite, mmm, similar and confusing.

wcawijngaards commented 7 months ago

For that sort of upstream, forward-zone is the one needed. The statement is to send the query to another server that performs recursive lookups. And that is what the DNS proxy to the upstream provider performs. So it is certainly the forward zone. The stub zone config is for talking to the authorities directly, eg. without a proxy or sending traffic to the ISP resolver.

catap commented 7 months ago

well, here I use local unbound to make recursive lookup and would like to use this private DNS to some private network.

Frankly speaking I may simple add NS records to global DNS to make setup simpler.

BTW replace stub- to forward- seems to fix all issues. Also, if I switch on system level to 1.1.1.1 DNS server, I can't reproduce the issue anzmore with stub- configuration.

bjacke commented 7 months ago

it is not ideal to discuss different problems in one bug report. The original report was about a simpe forward zone definition. This bug report turns out to become a discussion about differnt kind of similar issues while there is no news about the problem that was reported initially. I'm also surprised that such a standard feature like forward zones can be broken in a popular DNS server like unbound without the issue getting addressed. People who use their AD domain with the forward zone in unbound will not be able to authenticate their clients for an hour any more unless they find our about the forward-no-cache: yes option. But probably that option also just hides the problem because "the bogous answer is not cached".

catap commented 7 months ago

@bjacke as you said issue is quite similar and I applogize that a kind of hijack your issue.

From another hand, it is much better to keep unresolved issue, than a bot which closes it in 30 days of inactivity.

traeu commented 2 months ago

Are there any news about the original issue? Like @ciroiriarte, I experience this issue with latest OPNSense firewall version (it uses unbound 1.20.0_1) I reported my experiences in detail at https://github.com/opnsense/core/issues/7639 and would love to see a solution!

bjacke commented 2 months ago

I don't unserstand why people think that running a DNS server on a firewall device might be a good idea. Expect for some home router devices, this doesn't make any sense from a security point of view. If you look at the OPNSense folls, they even use a DNS server which has major functionality bug, which is known since some years and which doesn't get addressed. People should simply ignore the DNS server on their firewall device and install a dedicated stable DNS server like Bind in their LAN. Seriously, Unbound disqualified itself with this long-standing bug. I've seen a number of customers in the last couple of years, that I pointed to this bug report and moved them to Bind.

fichtner commented 2 months ago

@bjacke

If you look at the OPNSense folls, they even use a DNS server which has major functionality bug, which is known since some years and which doesn't get addressed.

Care to share your insights and provide relevant references?

Thanks, Franco

catap commented 2 months ago

Care to share your insights and provide relevant references?

A good example is https://github.com/NLnetLabs/unbound/issues/362 where quite a few interesting details of it's behaviour were discussed.

fichtner commented 2 months ago

I can't find a reference to "OPNsense" on #362, maybe I misread that previous rant. ;)

catap commented 2 months ago

It never says anything about OPNsense, but it had qutie a few insides about logic of Unbound and point of view of it's team which explains quite a lot if you use it as DNS resolver.

fichtner commented 2 months ago

Fair enough. I didn't ask or quote you to begin with.

Cheers, Franco