albertito / dnss

DNS over HTTPS [mirror]
https://blitiri.com.ar/git/r/dnss/
Other
126 stars 17 forks source link

dnss appears to be using the system resolver instead of the fallback flags. #9

Closed dseomn closed 3 years ago

dseomn commented 3 years ago

Hi,

I tried configuring dnss to use unbound on localhost as the fallback_upstream, but I'm getting SERVFAIL some of the time. Not always, just some of the time. From the logs, I think dnss might be trying to use the system resolver (systemd-resolved) instead of unbound to lookup fallback_domains.

I'm on Debian stable, using dnss 0.0~git20180721.0.2de63ab0-1+b11, so apologies if this is something that's been fixed already. I looked at the git history and didn't see anything that looked relevant though.

dnss command, from ps:

/usr/bin/dnss --dns_listen_addr=systemd --enable_cache=false --enable_dns_to_https --fallback_domains=dns.google. --fallback_upstream=[::1]:14653 --force_mode=DoH --https_upstream=https://dns.google/dns-query

Relevant ports, from ss:

udp   UNCONN 0      0                                 [::1]:25953                                                      [::]:*                                    users:(("dnss",pid=384,fd=5),("systemd",pid=1,fd=57))                          
tcp   LISTEN 0      128                               [::1]:25953                                                      [::]:*                                    users:(("dnss",pid=384,fd=3),("systemd",pid=1,fd=58)) 
udp   UNCONN 0      0                                 [::1]:14653                                                      [::]:*                                    users:(("unbound",pid=440,fd=3))                                               
tcp   LISTEN 0      128                               [::1]:14653                                                      [::]:*                                    users:(("unbound",pid=440,fd=4))

unbound is correctly resolving dns.google.:

# dig dns.google @::1 -p 14653

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> dns.google @::1 -p 14653
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1118
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;dns.google.                    IN      A

;; ANSWER SECTION:
dns.google.             744     IN      A       8.8.8.8
dns.google.             744     IN      A       8.8.4.4

;; Query time: 0 msec
;; SERVER: ::1#14653(::1)
;; WHEN: Mon Mar 01 23:31:20 EST 2021
;; MSG SIZE  rcvd: 71

But dnss is returning SERVFAIL:

# dig google.com @::1 -p 25953

; <<>> DiG 9.11.5-P4-5.1+deb10u3-Debian <<>> google.com @::1 -p 25953
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 4451
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;google.com.                    IN      A

;; Query time: 4001 msec
;; SERVER: ::1#25953(::1)
;; WHEN: Mon Mar 01 23:31:35 EST 2021
;; MSG SIZE  rcvd: 28

System logs, starting when I queried dnss for google.com:

Mar 01 23:33:51 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:33:51 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question . IN SOA: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question google IN DS: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question google IN SOA: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question google IN DNSKEY: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question dns.google IN DS: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question dns.google IN DNSKEY: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question dns.google IN AAAA: no-signature
Mar 01 23:33:51 sakaar systemd-resolved[843]: DNSSEC validation failed for question dns.google IN A: no-signature
Mar 01 23:33:54 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:33:55 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:33:55 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:33:56 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: dial tcp: lookup dns.google: Temporary failure in name resolution
Mar 01 23:33:56 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: dial tcp: lookup dns.google: Temporary failure in name resolution
Mar 01 23:34:00 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:34:00 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)                                                                                                                                                         
Mar 01 23:34:04 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Mar 01 23:34:04 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Mar 01 23:34:05 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Mar 01 23:34:06 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: dial tcp: lookup dns.google: Temporary failure in name resolution
Mar 01 23:34:06 sakaar dnss[786]: _ server.go:134      resolver query error: POST failed: Post https://dns.google/dns-query: dial tcp: lookup dns.google: Temporary failure in name resolution

The fact that systemd-resolved appears to be trying to look up dns.google makes me think that somehow dnss is using the local resolver via getaddrinfo() or similar, instead of querying the fallback_upstream, but I'm just guessing. Any ideas how this could happen?

dseomn commented 3 years ago

It seems that using a fallback_upstream on localhost is irrelevant. After changing to --fallback_upstream=8.8.8.8:53, I still had the same issue. However, changing systemd-resolved to point (indirectly) at unbound instead of dnss (thereby breaking the loop if dnss was trying to use getaddrinfo()) seems to have fixed the problem. So I think dnss is completely ignoring the fallback configuration, maybe?

dseomn commented 3 years ago

Since systemd-resolved had errors for . and google, I just tried running dnss with --fallback_domains=. google. dns.google., and the error went away. Is fallback_domains based on exact matches, or does it include subdomains? If it includes subdomains, then I just sent everything to the fallback_upstream which is pointless. But if it doesn't include subdomains, that might indicate that the problem is in interaction between the system resolver and dnss.

albertito commented 3 years ago

Thank you for the detailed bug report!

It's not immediately clear to me what's going on, since dnss has no special casing for systemd-resolved, fallback_upstream is used verbatim to resolve the domains in fallback_domains (but no subdomains, just exact matches).

I will take a closer look at this this week, and try to reproduce it.

In the meantime, if you can, you can use -monitoring_listen_addr to start a monitoring http server, and navigate to it. There you should be able to see the traces of the requests, including fallback requests, and those might give us a better clue of what's going on. Running dnss with -v=3 will log more detailed information (to the logs and the monitoring traces).

Thanks again!

dseomn commented 3 years ago

Here's the table from http://localhost:9981/debug/requests?fam=dnsserver&b=0&exp=1 taken right after a SERVFAIL response. Sorry for the formatting, I couldn't think of an easy way to get that to render right on GitHub.

When Elapsed (s) .
2021/03/03 18:58:36.453939 4.001256 Handler
18:58:36.453947 . 8 ... from:[::1]:34608 id:28789
18:58:36.453960 . 13 ... Q: (google. DNSKEY IN)
18:58:36.453992 . 32 ... DoH POST https://dns.google/dns-query
18:58:40.454911 4.000920 ... POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2021/03/03 18:58:36.451806 0.000734 Handler
18:58:36.451813 . 7 ... from:[::1]:27757 id:11695
18:58:36.451830 . 16 ... Q: (dns.google. DS IN)
18:58:36.452393 . 564 ... used fallback upstream ([::1]:14653)
18:58:36.452407 . 13 ... ;; opcode: QUERY, status: NOERROR, id: 11695 ;; flags: qr rd ra ad cd;
18:58:36.452424 . 17 ... dns.google. 3600 IN DS 56044 8 2 1B0A7E90AA6B1AC65AA5B573EFC44ABF6CB2559444251B997103D2E40C351B08
18:58:36.452451 . 27 ... dns.google. 3600 IN RRSIG DS 8 2 3600 20210322234431 20210228234431 57286 google. Ll+Fc9J8fUZ245tfppW846OXRZCsVrET+5x4cIQHXu4UtTkDwPKPcIrEn4gdtrYqvYmUezxLFMbyOjkJpCf5VoeCmcN/Y/p5eW722vRufc4dsceT9EC5M9uqofJ5hfoIzOjAPC27dtqP04qfPu4SMTfLkzjlynQDtTlnPNQ7gxk=
2021/03/03 18:58:36.449200 0.001100 Handler
18:58:36.449210 . 10 ... from:[::1]:58097 id:5630
18:58:36.449239 . 29 ... Q: (dns.google. DNSKEY IN)
18:58:36.450031 . 793 ... used fallback upstream ([::1]:14653)
18:58:36.450051 . 19 ... ;; opcode: QUERY, status: NOERROR, id: 5630 ;; flags: qr rd ra cd;
18:58:36.450080 . 29 ... dns.google. 86400 IN DNSKEY 256 3 8 AwEAAZxHngRtF/7L/WllES2Blv7IAE5I7EUNzuYh77xkOoZNtVP9oyIl/VscbW5mWMIL78LTxVooIVyJH5jTIhsYwo8xO2UjdgkXxBak+W6Z3o0iAiFfUHhAZa+UuXeAiAoOfCDLy2bkuL5sl8ZKGmO8+xDxkOTPZXG7jo8mpmjYaq/3
18:58:36.450106 . 26 ... dns.google. 86400 IN DNSKEY 257 3 8 AwEAAbwcDvL6jj3SGrhFmpiqahBVGlEIrt5OoVmgCINPJozYpXozSsb0yVkK86vOGcaYZXFoaUisWp0s3GJ5pYD74CyHc1s0CIbGdzNOXthhH2rAhR2YgsnKJUmv8Odo1bFbTy9vzrWd3qQ/Mgp9fOFTE7Dkys2c9bpJmvw0WScdW4xIxTXHilwIhqmrxeNjyRlgb3C8r2NWxjj4sqcATN9BSgk4fLvvB7UneWqWgM8kSd29bdL907nsg7eSuzETwTAHkU2mGHdC5i3kvRDgTqPC2zwk+tF5Rau6x78XkZJm5pcyD0AYsmusJqjGa60FFe759CZQ0Y8K7cZPvom/cozR3lM=
18:58:36.450149 . 43 ... dns.google. 86400 IN RRSIG DNSKEY 8 2 86400 20210402170952 20210303170952 56044 dns.google. hCC6mAhmPe4lkfExd5uLEJvUrEPrxQa2ruBjbu+2RwsWV6Ez/E9Vq+EDnD5ULS88bU6BI5RiwJ1lrpO/X5P4bDB9motXJ7cxQQDMMsq9pPBqgkrR4zzonwFLyHhIBw6uDbKx8DbJu99ILIgaTJaAEDb8JQRTSshOosxFpSovtByyqfUsUleCN9e6ts08D8JCBUgQEeLHR9v8Kjbm1Z1MaHlLczV+Ct8DQ+Z6TjUwFRzrH6rewDNPfDzVgGcdGoPvgqvYEp1u7BE7I+12z6iW9jrZL305dBsrpy+UnxV8U/7WMgqELoJXv+LqYjZj13rbo9AnRK6ndfG21EJWo8uFlQ==
2021/03/03 18:58:35.907520 0.539393 Handler
18:58:35.907536 . 16 ... from:[::1]:30984 id:12199
18:58:35.907565 . 29 ... Q: (dns.google. AAAA IN)
18:58:36.446543 .538978 ... used fallback upstream ([::1]:14653)
18:58:36.446579 . 37 ... ;; opcode: QUERY, status: NOERROR, id: 12199 ;; flags: qr rd ra ad cd;
18:58:36.446608 . 29 ... dns.google. 900 IN AAAA 2001:4860:4860::8844
18:58:36.446640 . 32 ... dns.google. 900 IN AAAA 2001:4860:4860::8888
18:58:36.446691 . 51 ... dns.google. 900 IN RRSIG AAAA 8 2 900 20210402170952 20210303170952 1773 dns.google. MGGNXtS05I8OGA7Isg6PH+N7bvIfsLO/jCRA1qzhV8ep8l9t8sh0hZo/Y9H41S3Df1jtMVAqnmsorMiqZ9kkMfjG9BEoFNbrhy1E93QqRSi3EYW+sJ6Qi9sCi7u0jqH/jsTzYXG74Io0ezR2gTifDEZpDfJyValleT03TcMvPLQ=
2021/03/03 18:58:35.904871 0.542443 Handler
18:58:35.904883 . 11 ... from:[::1]:19787 id:17363
18:58:35.904935 . 53 ... Q: (dns.google. A IN)
18:58:36.447125 .542190 ... used fallback upstream ([::1]:14653)
18:58:36.447141 . 16 ... ;; opcode: QUERY, status: NOERROR, id: 17363 ;; flags: qr rd ra ad cd;
18:58:36.447156 . 15 ... dns.google. 900 IN A 8.8.8.8
18:58:36.447169 . 14 ... dns.google. 900 IN A 8.8.4.4
18:58:36.447198 . 28 ... dns.google. 900 IN RRSIG A 8 2 900 20210402170952 20210303170952 1773 dns.google. MbxtIILgzvZ0jOdKA4Oan9nRucHQ2KEj2hANZzMyATzxPNp7ZxEpf6xUUpxBT1e404jPcBfKfdcJPT7fT3AKzQzxzDmWPuPU8J6E3mvhNQQg0CGQknaTWXtMTwrm9qMUDTB8GsmkX/MfcwFdkCIX/dyZyMSYxBQJuHJD2RfOfPw=
2021/03/03 18:58:35.898867 4.001653 Handler
18:58:35.898912 . 44 ... from:[::1]:36245 id:14133
18:58:35.898948 . 36 ... Q: (google.com. A IN)
18:58:35.899187 . 239 ... DoH POST https://dns.google/dns-query
18:58:39.900213 4.001027 ... POST failed: Post https://dns.google/dns-query: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

I also tried adding DNSSEC=no to /etc/systemd/resolved.conf and that fixed the issue. So I think what's happening is that dnss is using the system resolver for dns.google.. systemd-resolved then queries dnss for dns.google. A and AAAA which succeeds, but it also queries dnss for DNSSEC-related records on google. and . which fails because those aren't in fallback_domains and dnss can't POST to https://dns.google/dns-query because it's still waiting for a response from systemd-resolved.

So if I'm understanding this correctly, I think that means there are 4 options:

  1. I could set --fallback_domains='. google. dns.google.' so that systemd-resolved can successfully do DNSSEC verification before dnss is able to POST to https://dns.google/dns-query.
  2. I could configure systemd-resolved with DNSSEC=no so that it only looks up A and AAAA on dns.google. without trying to query for google. or ..
  3. dnss could bypass the system resolver and talk to fallback_upstream directly.
  4. dnss could add a feature to specify the address to connect to for https_upstream separately from the hostname used for TLS SNI and the HTTP Host header. Maybe something like -https_upstream=https://dns.google/dns-query -https_upstream_address=8.8.8.8,8.8.4.4. Then the fallback flags wouldn't be necessary.

Do you think either (3) or (4) are desirable? Any idea how easy they would be? I don't really know any go, but I could potentially give it a shot if you want.

albertito commented 3 years ago

I think you're right. I spent a bit looking at the code and can confirm that dnss is using the system's default resolver to find upstream. Normally that request is either directed to dnss, and the fallback mechanism gets engaged, or it's a normal DNS lookup that returns a usable value; but in this case due to the systemd-resolved and dnssec interactions the loop isn't working properly. I don't know why it occasionally works, but I suspect that's due to some systemd-resolved behaviour, and it doesn't change the nature of the problem anyway.

As for the options, (3) is probably cleaner, reasonably safe to change, and should work in a wide range of circumstances. It's also the least surprising, I think. It's not straightforward to do cleanly due to how the code is structured, but it's doable.

For (4) I think it's also doable but much more brittle and prone to accidental issues on a sensitive area, so I rather avoid it. Note that on domains with certificates with IP fields, dnss works just fine. That's how for example https://1.1.1.1/dns-query can be used with dnss, because there is a valid certificate for it.

Until I have a change ready to try (which will take me a couple of days) as a workaround (1) should work just fine, and you won't lose dnssec resolution for most domains.

I'll let you know once there's a patch to try :)

Thanks a lot again for such detailed bug report and all the help in figuring out what's going on!

albertito commented 3 years ago

If you want, you can give commit 7d3ca40b6f a try, which is on the next branch.

It's a work in progress so it will be amended, but I did some ad-hoc tests and it seems to work fine, and in a way that should solve your issue since systemd-resolved should be totally bypassed now.

I will polish and extend it in the next few days, and will let you know once there's a more tested version; but just in case you want to give it a try early.

Thanks!

dseomn commented 3 years ago

I don't know why it occasionally works, but I suspect that's due to some systemd-resolved behaviour, and it doesn't change the nature of the problem anyway.

Oh sorry, I think I figured that out already but didn't say anything because it didn't seem too relevant. While debugging initially, I kept changing settings and restarting dnss but not restarting systemd-resolved. I think systemd-resolved has its own cache, so if I got things working, it could cache dns.google. and then when I changed dnss's settings back to something that normally didn't work, it would sometimes work because of systemd-resolved's cache. Once I started running the below command between every change, everything was deterministic. (And yes, I'm aware that 4 separate DNS servers on one machine is a lot :) )

systemctl daemon-reload && service systemd-resolved restart && service dnsmasq restart && service unbound restart && service dnss restart

If you want, you can give commit 7d3ca40 a try, which is on the next branch.

That worked, thanks!

Thanks a lot again for such detailed bug report and all the help in figuring out what's going on!

Thanks for fixing it!

albertito commented 3 years ago

The fix for this issue is now in the master branch, thanks again!