Closed dseomn closed 3 years ago
It seems that using a fallback_upstream on localhost is irrelevant. After changing to --fallback_upstream=8.8.8.8:53
, I still had the same issue. However, changing systemd-resolved
to point (indirectly) at unbound instead of dnss (thereby breaking the loop if dnss was trying to use getaddrinfo()
) seems to have fixed the problem. So I think dnss is completely ignoring the fallback configuration, maybe?
Since systemd-resolved had errors for .
and google
, I just tried running dnss with --fallback_domains=. google. dns.google.
, and the error went away. Is fallback_domains
based on exact matches, or does it include subdomains? If it includes subdomains, then I just sent everything to the fallback_upstream
which is pointless. But if it doesn't include subdomains, that might indicate that the problem is in interaction between the system resolver and dnss.
Thank you for the detailed bug report!
It's not immediately clear to me what's going on, since dnss has no special casing for systemd-resolved, fallback_upstream
is used verbatim to resolve the domains in fallback_domains
(but no subdomains, just exact matches).
I will take a closer look at this this week, and try to reproduce it.
In the meantime, if you can, you can use -monitoring_listen_addr
to start a monitoring http server, and navigate to it. There you should be able to see the traces of the requests, including fallback requests, and those might give us a better clue of what's going on. Running dnss with -v=3
will log more detailed information (to the logs and the monitoring traces).
Thanks again!
Here's the table from http://localhost:9981/debug/requests?fam=dnsserver&b=0&exp=1 taken right after a SERVFAIL
response. Sorry for the formatting, I couldn't think of an easy way to get that to render right on GitHub.
When | Elapsed (s) | . |
---|---|---|
2021/03/03 18:58:36.453939 | 4.001256 | Handler |
18:58:36.453947 | . 8 | ... from:[::1]:34608 id:28789 |
18:58:36.453960 | . 13 | ... Q: (google. DNSKEY IN) |
18:58:36.453992 | . 32 | ... DoH POST https://dns.google/dns-query |
18:58:40.454911 | 4.000920 | ... POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) |
2021/03/03 18:58:36.451806 | 0.000734 | Handler |
18:58:36.451813 | . 7 | ... from:[::1]:27757 id:11695 |
18:58:36.451830 | . 16 | ... Q: (dns.google. DS IN) |
18:58:36.452393 | . 564 | ... used fallback upstream ([::1]:14653) |
18:58:36.452407 | . 13 | ... ;; opcode: QUERY, status: NOERROR, id: 11695 ;; flags: qr rd ra ad cd; |
18:58:36.452424 | . 17 | ... dns.google. 3600 IN DS 56044 8 2 1B0A7E90AA6B1AC65AA5B573EFC44ABF6CB2559444251B997103D2E40C351B08 |
18:58:36.452451 | . 27 | ... dns.google. 3600 IN RRSIG DS 8 2 3600 20210322234431 20210228234431 57286 google. Ll+Fc9J8fUZ245tfppW846OXRZCsVrET+5x4cIQHXu4UtTkDwPKPcIrEn4gdtrYqvYmUezxLFMbyOjkJpCf5VoeCmcN/Y/p5eW722vRufc4dsceT9EC5M9uqofJ5hfoIzOjAPC27dtqP04qfPu4SMTfLkzjlynQDtTlnPNQ7gxk= |
2021/03/03 18:58:36.449200 | 0.001100 | Handler |
18:58:36.449210 | . 10 | ... from:[::1]:58097 id:5630 |
18:58:36.449239 | . 29 | ... Q: (dns.google. DNSKEY IN) |
18:58:36.450031 | . 793 | ... used fallback upstream ([::1]:14653) |
18:58:36.450051 | . 19 | ... ;; opcode: QUERY, status: NOERROR, id: 5630 ;; flags: qr rd ra cd; |
18:58:36.450080 | . 29 | ... dns.google. 86400 IN DNSKEY 256 3 8 AwEAAZxHngRtF/7L/WllES2Blv7IAE5I7EUNzuYh77xkOoZNtVP9oyIl/VscbW5mWMIL78LTxVooIVyJH5jTIhsYwo8xO2UjdgkXxBak+W6Z3o0iAiFfUHhAZa+UuXeAiAoOfCDLy2bkuL5sl8ZKGmO8+xDxkOTPZXG7jo8mpmjYaq/3 |
18:58:36.450106 | . 26 | ... dns.google. 86400 IN DNSKEY 257 3 8 AwEAAbwcDvL6jj3SGrhFmpiqahBVGlEIrt5OoVmgCINPJozYpXozSsb0yVkK86vOGcaYZXFoaUisWp0s3GJ5pYD74CyHc1s0CIbGdzNOXthhH2rAhR2YgsnKJUmv8Odo1bFbTy9vzrWd3qQ/Mgp9fOFTE7Dkys2c9bpJmvw0WScdW4xIxTXHilwIhqmrxeNjyRlgb3C8r2NWxjj4sqcATN9BSgk4fLvvB7UneWqWgM8kSd29bdL907nsg7eSuzETwTAHkU2mGHdC5i3kvRDgTqPC2zwk+tF5Rau6x78XkZJm5pcyD0AYsmusJqjGa60FFe759CZQ0Y8K7cZPvom/cozR3lM= |
18:58:36.450149 | . 43 | ... dns.google. 86400 IN RRSIG DNSKEY 8 2 86400 20210402170952 20210303170952 56044 dns.google. hCC6mAhmPe4lkfExd5uLEJvUrEPrxQa2ruBjbu+2RwsWV6Ez/E9Vq+EDnD5ULS88bU6BI5RiwJ1lrpO/X5P4bDB9motXJ7cxQQDMMsq9pPBqgkrR4zzonwFLyHhIBw6uDbKx8DbJu99ILIgaTJaAEDb8JQRTSshOosxFpSovtByyqfUsUleCN9e6ts08D8JCBUgQEeLHR9v8Kjbm1Z1MaHlLczV+Ct8DQ+Z6TjUwFRzrH6rewDNPfDzVgGcdGoPvgqvYEp1u7BE7I+12z6iW9jrZL305dBsrpy+UnxV8U/7WMgqELoJXv+LqYjZj13rbo9AnRK6ndfG21EJWo8uFlQ== |
2021/03/03 18:58:35.907520 | 0.539393 | Handler |
18:58:35.907536 | . 16 | ... from:[::1]:30984 id:12199 |
18:58:35.907565 | . 29 | ... Q: (dns.google. AAAA IN) |
18:58:36.446543 | .538978 | ... used fallback upstream ([::1]:14653) |
18:58:36.446579 | . 37 | ... ;; opcode: QUERY, status: NOERROR, id: 12199 ;; flags: qr rd ra ad cd; |
18:58:36.446608 | . 29 | ... dns.google. 900 IN AAAA 2001:4860:4860::8844 |
18:58:36.446640 | . 32 | ... dns.google. 900 IN AAAA 2001:4860:4860::8888 |
18:58:36.446691 | . 51 | ... dns.google. 900 IN RRSIG AAAA 8 2 900 20210402170952 20210303170952 1773 dns.google. MGGNXtS05I8OGA7Isg6PH+N7bvIfsLO/jCRA1qzhV8ep8l9t8sh0hZo/Y9H41S3Df1jtMVAqnmsorMiqZ9kkMfjG9BEoFNbrhy1E93QqRSi3EYW+sJ6Qi9sCi7u0jqH/jsTzYXG74Io0ezR2gTifDEZpDfJyValleT03TcMvPLQ= |
2021/03/03 18:58:35.904871 | 0.542443 | Handler |
18:58:35.904883 | . 11 | ... from:[::1]:19787 id:17363 |
18:58:35.904935 | . 53 | ... Q: (dns.google. A IN) |
18:58:36.447125 | .542190 | ... used fallback upstream ([::1]:14653) |
18:58:36.447141 | . 16 | ... ;; opcode: QUERY, status: NOERROR, id: 17363 ;; flags: qr rd ra ad cd; |
18:58:36.447156 | . 15 | ... dns.google. 900 IN A 8.8.8.8 |
18:58:36.447169 | . 14 | ... dns.google. 900 IN A 8.8.4.4 |
18:58:36.447198 | . 28 | ... dns.google. 900 IN RRSIG A 8 2 900 20210402170952 20210303170952 1773 dns.google. MbxtIILgzvZ0jOdKA4Oan9nRucHQ2KEj2hANZzMyATzxPNp7ZxEpf6xUUpxBT1e404jPcBfKfdcJPT7fT3AKzQzxzDmWPuPU8J6E3mvhNQQg0CGQknaTWXtMTwrm9qMUDTB8GsmkX/MfcwFdkCIX/dyZyMSYxBQJuHJD2RfOfPw= |
2021/03/03 18:58:35.898867 | 4.001653 | Handler |
18:58:35.898912 | . 44 | ... from:[::1]:36245 id:14133 |
18:58:35.898948 | . 36 | ... Q: (google.com. A IN) |
18:58:35.899187 | . 239 | ... DoH POST https://dns.google/dns-query |
18:58:39.900213 | 4.001027 | ... POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) |
I also tried adding DNSSEC=no
to /etc/systemd/resolved.conf
and that fixed the issue. So I think what's happening is that dnss is using the system resolver for dns.google.
. systemd-resolved then queries dnss for dns.google.
A
and AAAA
which succeeds, but it also queries dnss for DNSSEC-related records on google.
and .
which fails because those aren't in fallback_domains
and dnss can't POST
to https://dns.google/dns-query
because it's still waiting for a response from systemd-resolved.
So if I'm understanding this correctly, I think that means there are 4 options:
--fallback_domains='. google. dns.google.'
so that systemd-resolved can successfully do DNSSEC verification before dnss is able to POST
to https://dns.google/dns-query
.DNSSEC=no
so that it only looks up A
and AAAA
on dns.google.
without trying to query for google.
or .
.fallback_upstream
directly.https_upstream
separately from the hostname used for TLS SNI and the HTTP Host header. Maybe something like -https_upstream=https://dns.google/dns-query -https_upstream_address=8.8.8.8,8.8.4.4
. Then the fallback flags wouldn't be necessary.Do you think either (3) or (4) are desirable? Any idea how easy they would be? I don't really know any go, but I could potentially give it a shot if you want.
I think you're right. I spent a bit looking at the code and can confirm that dnss is using the system's default resolver to find upstream. Normally that request is either directed to dnss, and the fallback mechanism gets engaged, or it's a normal DNS lookup that returns a usable value; but in this case due to the systemd-resolved and dnssec interactions the loop isn't working properly. I don't know why it occasionally works, but I suspect that's due to some systemd-resolved behaviour, and it doesn't change the nature of the problem anyway.
As for the options, (3) is probably cleaner, reasonably safe to change, and should work in a wide range of circumstances. It's also the least surprising, I think. It's not straightforward to do cleanly due to how the code is structured, but it's doable.
For (4) I think it's also doable but much more brittle and prone to accidental issues on a sensitive area, so I rather avoid it. Note that on domains with certificates with IP fields, dnss works just fine. That's how for example https://1.1.1.1/dns-query
can be used with dnss, because there is a valid certificate for it.
Until I have a change ready to try (which will take me a couple of days) as a workaround (1) should work just fine, and you won't lose dnssec resolution for most domains.
I'll let you know once there's a patch to try :)
Thanks a lot again for such detailed bug report and all the help in figuring out what's going on!
If you want, you can give commit 7d3ca40b6f a try, which is on the next
branch.
It's a work in progress so it will be amended, but I did some ad-hoc tests and it seems to work fine, and in a way that should solve your issue since systemd-resolved should be totally bypassed now.
I will polish and extend it in the next few days, and will let you know once there's a more tested version; but just in case you want to give it a try early.
Thanks!
I don't know why it occasionally works, but I suspect that's due to some systemd-resolved behaviour, and it doesn't change the nature of the problem anyway.
Oh sorry, I think I figured that out already but didn't say anything because it didn't seem too relevant. While debugging initially, I kept changing settings and restarting dnss but not restarting systemd-resolved. I think systemd-resolved has its own cache, so if I got things working, it could cache dns.google.
and then when I changed dnss's settings back to something that normally didn't work, it would sometimes work because of systemd-resolved's cache. Once I started running the below command between every change, everything was deterministic. (And yes, I'm aware that 4 separate DNS servers on one machine is a lot :) )
systemctl daemon-reload && service systemd-resolved restart && service dnsmasq restart && service unbound restart && service dnss restart
If you want, you can give commit 7d3ca40 a try, which is on the next branch.
That worked, thanks!
Thanks a lot again for such detailed bug report and all the help in figuring out what's going on!
Thanks for fixing it!
The fix for this issue is now in the master
branch, thanks again!
Hi,
I tried configuring dnss to use unbound on localhost as the
fallback_upstream
, but I'm gettingSERVFAIL
some of the time. Not always, just some of the time. From the logs, I think dnss might be trying to use the system resolver (systemd-resolved) instead of unbound to lookupfallback_domains
.I'm on Debian stable, using dnss
0.0~git20180721.0.2de63ab0-1+b11
, so apologies if this is something that's been fixed already. I looked at the git history and didn't see anything that looked relevant though.dnss command, from
ps
:Relevant ports, from
ss
:unbound is correctly resolving
dns.google.
:But dnss is returning
SERVFAIL
:System logs, starting when I queried dnss for
google.com
:The fact that systemd-resolved appears to be trying to look up
dns.google
makes me think that somehow dnss is using the local resolver viagetaddrinfo()
or similar, instead of querying thefallback_upstream
, but I'm just guessing. Any ideas how this could happen?