folbricht / routedns

DNS stub resolver, proxy and router with support for DoT, DoH, DoQ, and DTLS
BSD 3-Clause "New" or "Revised" License
480 stars 63 forks source link

High amount of open udp sockets when using DoQ #215

Open brotaxt opened 2 years ago

brotaxt commented 2 years ago

Hey,

I'm seeing a huge amount of the following error messages in my routedns logs.

Feb 08 18:27:11 localhost routedns[27140]: time="2022-02-08T18:27:11+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=ns.eg-services.net.
Feb 08 18:27:13 localhost routedns[27140]: time="2022-02-08T18:27:13+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="timeout: no recent network activity" id=local-doh-behind-proxy protocol=udp qname=ns.eg-services.net.
Feb 08 18:27:13 localhost routedns[27140]: time="2022-02-08T18:27:13+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="timeout: no recent network activity" id=local-doh-behind-proxy protocol=udp qname=ns.eg-services.net.
Feb 08 19:04:53 localhost routedns[27140]: time="2022-02-08T19:04:53+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=110.119.193.42.dnsbl.sorbs.net.
Feb 08 19:04:54 localhost routedns[27140]: time="2022-02-08T19:04:54+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=110.119.193.42.dnsbl.sorbs.net.
Feb 08 19:05:06 localhost routedns[27140]: time="2022-02-08T19:05:06+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=59.152.214.162.psbl.surriel.com.
Feb 08 19:08:52 localhost routedns[27140]: time="2022-02-08T19:08:52+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=200.165.6.71.dnsbl.sorbs.net.
Feb 08 19:23:41 localhost routedns[27140]: time="2022-02-08T19:23:41+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=154.29.56.193.ips.backscatterer.org.
Feb 08 19:51:00 localhost routedns[27140]: time="2022-02-08T19:51:00+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=227.84.0.173.dnsbl.sorbs.net.
Feb 08 19:51:01 localhost routedns[27140]: time="2022-02-08T19:51:01+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="deadline exceeded" id=local-doh-behind-proxy protocol=udp qname=227.84.0.173.dnsbl.sorbs.net.
Feb 08 20:06:02 localhost routedns[27140]: time="2022-02-08T20:06:02+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="dns: overflow unpacking uint16" id=local-doh-behind-proxy protocol=udp qname=m.youtube.com.
Feb 08 20:20:33 localhost routedns[27140]: time="2022-02-08T20:20:33+01:00" level=error msg="failed to resolve" addr="[::1]:5353" client="::1" error="dns: overflow unpacking uint16" id=local-doh-behind-proxy protocol=udp qname=2.tlu.dl.delivery.mp.microsoft.com.

After some time the system shows a huge amount of unused and open udp sockets, that doesn't get closed.

root@localhost:~/routedns/routedns# netstat -pan |grep routedns| wc -l
125

udp6       0      0 :::55856                :::*                                1003       4918191    27140/routedns
udp6       0      0 :::48692                :::*                                1003       4909893    27140/routedns
udp6       0      0 :::53820                :::*                                1003       4907461    27140/routedns
udp6       0      0 :::37439                :::*                                1003       4916120    27140/routedns
udp6       0      0 :::48706                :::*                                1003       4948577    27140/routedns
udp6       0      0 :::54853                :::*                                1003       4952859    27140/routedns
udp6       0      0 :::60997                :::*                                1003       4945554    27140/routedns
udp6       0      0 :::46662                :::*                                1003       4986870    27140/routedns
udp6       0      0 :::49740                :::*                                1003       4974251    27140/routedns
udp6       0      0 :::40525                :::*                                1003       4898577    27140/routedns
udp6       0      0 :::49756                :::*                                1003       4940992    27140/routedns
udp6       0      0 :::54883                :::*                                1003       4945407    27140/routedns
udp6       0      0 :::45679                :::*                                1003       4904812    27140/routedns
udp6       0      0 :::50800                :::*                                1003       4915990    27140/routedns
udp6       0      0 :::49785                :::*                                1003       4957260    27140/routedns
udp6       0      0 :::56972                :::*                                1003       4937591    27140/routedns
udp6       0      0 :::33428                :::*                                1003       4971963    27140/routedns

Do you have any idea where that issues could come from ?

This odd dnsbl.sorbs.net queries are coming from the installed mailserver. Description: https://www.spamhaus.org/faq/section/DNSBL%20Usage)

My configuration:

/opt/routedns/config.toml

title = "RouteDNS configuration"

[resolvers]

  [resolvers.nextdns]
  address = "a6xxxx.dns.nextdns.io:8853"
  protocol = "doq"

[groups]

  [groups.nextdns_cached]
  type = "cache"
  cache-size = 0
  resolvers = ["nextdns"]
#  gc-period = 120

[listeners]

  [listeners.local-doh-behind-proxy]
  address = "[::1]:5353"
  protocol = "udp"
  resolver = "nextdns_cached"

/etc/dnsmasq.d/01-pihole.conf

addn-hosts=/etc/pihole/local.list
addn-hosts=/etc/pihole/custom.list

localise-queries

no-resolv

cache-size=10000

log-facility=/var/log/pihole.log

log-async
domain-needed
expand-hosts
bogus-priv
interface=eth0
server=::1#5353

/etc/systemd/system/routedns.service

[Unit]
Description=RouteDNS - DNS stub resolver and router
After=network.target

[Service]
Type=simple
User=routedns
WorkingDirectory=/opt/routedns
AmbientCapabilities=CAP_NET_BIND_SERVICE
ExecStart=/opt/routedns/routedns /opt/routedns/config.toml
Restart=on-failure
Nice=-5

[Install]
WantedBy=multi-user.target
brotaxt commented 2 years ago

The debug/trace logs from routedns doesn't show any inconsistencies.

For which purpose does routedns create these udp6 sockets? The connection to the upstream resolver (nextdns) is established via IPv4.

folbricht commented 2 years ago

This was reported before but I wasn't able to reproduce it. Perhaps we can debug it a bit more this time.

The IP6 sockets could come from your listener, or just Linux using those for outbound queries (I believe those do IP4 as well).

Is your server under heavy load? Are there failures that could perhaps leave sockets behind. I'd like to find a way to narrow it down a bit.

brotaxt commented 2 years ago

I've changed the loopback addresses from IPv6 (::1) to IPv4 (127.0.0.1) and tested DNS over https (quic)

Unfortunately both changes didn't resolve the many open udp sockets.

Even the mentioned log messages doesn't seem to have anything to do with the socket-problem. I've seen multiple open sockets, while not one single error was logged.

The system load is very low the whole time.

TCP based protocols (DoT or DoH) doesn't show such behavior. It's not using any v6 sockets with tcp.

folbricht commented 2 years ago

Interesting. Would you be able to try and find a minimal configuration where you still see this issue, and maybe a way to trigger it (like # of requests per min or so). Once I can reproduce this I might be able to sort it out.

cbuijs commented 2 years ago

I had similar issues using DoQ with NextDNS. I switched using DoH and that worked much better and more reliable (TCP). In the end I just gave up on DoQ, it feels immature.

Example:

# Uncomment below to prevent DNS leakage via system resolver
#[bootstrap-resolver] 
#address = "https://dns.nextdns.io:443/a6xxxx"
#protocol = "doh"
#bootstrap-address = "45.90.28.0"

[resolvers.nextdns]
address = "https://dns.nextdns.io:443/a6xxxx"
protocol = "doh"

Also had some good results (using DoQ) by upping the net.core.rmem_max value, as UDP buffering could be the culprit. Just run sysctl -w net.core.rmem_max=2500000 before starting RouteDNS, a common issue reported when using DoQ.

Also see some hints and tips here: https://github.com/lucas-clemente/quic-go/issues And: https://caddy.community/t/udp-receive-buffer-size-quic/11244

charlieporth1 commented 2 years ago

@cbuijs I am having this issue on DoQ as well, but from my own servers

cbuijs commented 2 years ago

Can you change to DoH? It just works better.

brotaxt commented 2 years ago

I had already raised rmem_max to 16777216

This were my kernel parameters when I encoutered the socket-problem.

## raise the udp buffers
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384

#
## 16MB per socket - which sounds like a lot, but will virtually never consume that much.
#

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.core.rmem_default = 16777216
net.core.wmem_default = 16777216

net.core.optmem_max = 16777216

net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 87380 16777216
brotaxt commented 2 years ago

While using DoT, Ive seen many timeout messages for the "special" DNS-Responses from dnsbl.sorbs.net. (DNS-Replies contains different loopback IPs to determine if an IP has been blacklisted or not, described here: https://www.zytrax.com/books/dns/ch9/dnsbl.html)

Feb 12 15:50:08 localhost routedns[17371]: time="2022-02-12T15:50:08+01:00" level=error msg="failed to resolve" addr="127.0.0.1:5353" client=127.0.0.1 error="query for '24.27.254.178.dnsbl.sorbs.net.' timed out" id=local-doh-behind-proxy protocol=udp qname=24.27.254.178.dnsbl.sorbs.net.
Feb 12 16:07:50 localhost routedns[17371]: time="2022-02-12T16:07:50+01:00" level=error msg="failed to resolve" addr="127.0.0.1:5353" client=127.0.0.1 error="query for '40.97.237.109.dnsbl.sorbs.net.' timed out" id=local-doh-behind-proxy protocol=udp qname=40.97.237.109.dnsbl.sorbs.net.
Feb 12 16:07:51 localhost routedns[17371]: time="2022-02-12T16:07:51+01:00" level=error msg="failed to resolve" addr="127.0.0.1:5353" client=127.0.0.1 error="query for '40.97.237.109.dnsbl.sorbs.net.' timed out" id=local-doh-behind-proxy protocol=udp qname=40.97.237.109.dnsbl.sorbs.net.

These timeout messages are completely gone after switching to DoH.

Interesting. Would you be able to try and find a minimal configuration where you still see this issue, and maybe a way to trigger it (like # of requests per min or so). Once I can reproduce this I might be able to sort it out.

I'll think about a suitable configuration to further investigate that problem.

cbuijs commented 2 years ago

Could be throttling by SORBS cutting of some of the queries? DoH is "in session" meaning one-session for all. Looks like DoT is a session per query, or at least more sessions then DoH which maybe hits a throttling threshold at SORBS?

brotaxt commented 2 years ago

I recognize a little delay when I query SORBS to check an IP address. I guess that's the reason for the timeout messages. Unfortunately I couldn't find any parameter to define the reponse timeout in routedns.

... DoH is "in session" meaning one-session for all. Looks like DoT is a session per query, or at least more sessions then DoH...

Is this really the case? I couldn't notice this behavior. Even with DoT my system establishes one TCP-Session to the upstream resolver which is then used the whole time.

cbuijs commented 2 years ago

... DoH is "in session" meaning one-session for all. Looks like DoT is a session per query, or at least more sessions then DoH...

Is this really the case? I couldn't notice this behavior. Even with DoT my system establishes one TCP-Session to the upstream resolver which is then used the whole time.

Not sure, I do see (much) more "new connections" when using DoT then when using DoH. Might be a timeout/timing thing that is different between the two. Might be that the DoH implementation has "keep-alive" better implemented (as part of the transport, I am not an expert). In my setup, DoT has lower performance then DoH (where DoH is about 30-50% faster resolving).

When I look at DoT and DoH implementation (NGINX as example), keepalive is always definable as part of DoH as it is part of the transport. Also most DoH libraries/modules support RFC7828 as part of the DNS layer. For DoT is seems to be less of a case. Also have the feeling that development efforts all lean to DoH and of course HTTPS is more reliable and mature as well.

This is not really solving your issues :-). Just wanted to be informative and sharing some experiences! ;-).

brotaxt commented 2 years ago

This is not really solving your issues :-). Just wanted to be informative and sharing some experiences! ;-).

But it's very interesting anyway, thank you very much! :-)

brotaxt commented 2 years ago

The issue only occurs when using DoQ. DoH3 (DoH with Quic as transport protocol) works as intended. I've recompiled routedns with newer versions of the dependencies (lucas-clemente/quic-go and the other ones found in go.mod) but that didnt change anything :(

Maybe the problem has something to do with this issues in lucas-clemente/quic-go

folbricht commented 2 years ago

It's likely somewhat related. Both of those tickets are effectively for the same underlying problem in the library, the inability to deal with closed connections/sessions after timeout, which is especially bad in the context of DNS. I had to work around that problem in routedns. It's quite possible that the issue you're seeing is related to an incomplete workaround on my part. I don't have time to debug this properly at the moment, but will get back to it

cbuijs commented 1 year ago

Not sure if it belongs here, but...

When I use the doq as listener, i get only empty responses when using it.

In the error-log I see this message on the doq queries coming in:

msg="failed to decode query" addr=":853" client="x.x.x.x:39189" error="dns: overflow unpacking uint16" id=doq protocol=doq

The query I am doing:

q -v @quic://xxx.xxxxxxxxx.com:853 www.paypal.com a
DEBU[0000] RR types: [A]                                
DEBU[0000] Using scheme: quic host: xxx.xxxxxxxxx.com port: 853 
DEBU[0000] Using QUIC transport                         
DEBU[0000] Dialing with QUIC ALPN tokens: [doq doq-i11] 
FATA[0000] empty response from xxx.xxxxxxxxx.com:853

(I am using "q" as lookup tool, I highly recommend it: https://github.com/natesales/q)

When querying against dns.nextdns.io it works fine.

Using doq as resolver (outgoing queries, to for example dns.nextdns.io) that works fine. Between routedns servers, that works fine too. As soon as I use any DOQ resolver (server or tool), it gives above errors.

folbricht commented 1 year ago

There was a change in the doq spec fairly late that required the length to be prepended to the packet, similar to how it works over tcp. I think that's what's happening here. I thought that routedns is updated to expect that length prefix. Perhaps your query tool isn't sending the prefix? Wonder if there's a way to confirm that.

cbuijs commented 1 year ago

It works with other QUIC based DNS servers or public servers (like Adguard and NextDNS) using the same q tool.

It does work with this tool (from the Adguard guy): https://github.com/ameshkov/dnslookup

dnslookup www.paypal.com quic://xxx.xxxxxxxxx.com
dnslookup v1.9.1
dnslookup result (elapsed 486.546875ms):
;; opcode: QUERY, status: NOERROR, id: 47465
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version 0; flags:; udp: 4096
; PADDING: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

;; QUESTION SECTION:
;www.paypal.com.    IN   A

;; ANSWER SECTION:
www.paypal.com. 1400    IN  A   151.101.37.21

Which does the padding and the rest I guess. Seems that RouteDNS is maybe to strict on this and mandating padding/length? Which I think is not a requirement but recommended? As it works regardless with the public QUIC servers?

folbricht commented 1 year ago

If you have a recent version of dnslookup it should add the 2-byte prefix as per https://github.com/AdguardTeam/dnsproxy/blob/f8f22ab752e825c655c5174becbfce0a4c430fdc/proxyutil/dns.go#L69-L76 And routedns should be able to understand that. If not that'd be a bug. I could also make it allow the old format without the length-prefix perhaps so it won't fail with older tools

cbuijs commented 1 year ago

It does and works indeed. Being "backwards compatible" might make sense. DNS standards change quite a lot in short periods, adaptation/implementation of those changes: not at the same pace :-).

cbuijs commented 1 year ago

Not sure if it is the same, but with the latest build I see these messages popping up now, never had them before. Seems to be load related (Simulate with Browserleaks DNS test).

msg="temporary fail when trying to open stream, attempting new connection" endpoint="x.x.x.x:853" error="Application error 0x0 (remote)" protocol=doq

This is my home RouteDNS instance forwarding to another one I host on internet. Same versions/builds (latest).

That said, I think I am not losing any queries/responses.

folbricht commented 1 year ago

@cbuijs Pretty sure this is just a log message that shows up when an old connection has "expired" and a new one needs to be opened. I had to make some small changes there to upgrade the quic library recently, so the message may look new but it probably isn't really. You should be able to repro by making a query, then waiting a couple minutes (depends on what the timeout is server-side), then send another query, I suspect you'll see that message.

That said, do you see a lot of open sockets from the process? I haven't been able to repro an actual leak yet but if there is one that'd be a priority to fix

cbuijs commented 1 year ago

Number of open socks is pretty low I would say. It is peaky but not in the numbers I would worry about. Also don't stay open for long. Feels normal.