Open brotaxt opened 2 years ago
The debug/trace logs from routedns doesn't show any inconsistencies.
For which purpose does routedns create these udp6 sockets? The connection to the upstream resolver (nextdns) is established via IPv4.
This was reported before but I wasn't able to reproduce it. Perhaps we can debug it a bit more this time.
The IP6 sockets could come from your listener, or just Linux using those for outbound queries (I believe those do IP4 as well).
Is your server under heavy load? Are there failures that could perhaps leave sockets behind. I'd like to find a way to narrow it down a bit.
I've changed the loopback addresses from IPv6 (::1) to IPv4 (127.0.0.1) and tested DNS over https (quic)
Unfortunately both changes didn't resolve the many open udp sockets.
Even the mentioned log messages doesn't seem to have anything to do with the socket-problem. I've seen multiple open sockets, while not one single error was logged.
The system load is very low the whole time.
TCP based protocols (DoT or DoH) doesn't show such behavior. It's not using any v6 sockets with tcp.
Interesting. Would you be able to try and find a minimal configuration where you still see this issue, and maybe a way to trigger it (like # of requests per min or so). Once I can reproduce this I might be able to sort it out.
I had similar issues using DoQ
with NextDNS. I switched using DoH
and that worked much better and more reliable (TCP). In the end I just gave up on DoQ
, it feels immature.
Example:
# Uncomment below to prevent DNS leakage via system resolver
#[bootstrap-resolver]
#address = "https://dns.nextdns.io:443/a6xxxx"
#protocol = "doh"
#bootstrap-address = "45.90.28.0"
[resolvers.nextdns]
address = "https://dns.nextdns.io:443/a6xxxx"
protocol = "doh"
Also had some good results (using DoQ
) by upping the net.core.rmem_max
value, as UDP buffering could be the culprit. Just run sysctl -w net.core.rmem_max=2500000
before starting RouteDNS, a common issue reported when using DoQ
.
Also see some hints and tips here: https://github.com/lucas-clemente/quic-go/issues And: https://caddy.community/t/udp-receive-buffer-size-quic/11244
@cbuijs I am having this issue on DoQ as well, but from my own servers
Can you change to DoH
? It just works better.
I had already raised rmem_max to 16777216
This were my kernel parameters when I encoutered the socket-problem.
## raise the udp buffers
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
#
## 16MB per socket - which sounds like a lot, but will virtually never consume that much.
#
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 87380 16777216
While using DoT, Ive seen many timeout messages for the "special" DNS-Responses from dnsbl.sorbs.net. (DNS-Replies contains different loopback IPs to determine if an IP has been blacklisted or not, described here: https://www.zytrax.com/books/dns/ch9/dnsbl.html)
Feb 12 15:50:08 localhost routedns[17371]: time="2022-02-12T15:50:08+01:00" level=error msg="failed to resolve" addr="127.0.0.1:5353" client=127.0.0.1 error="query for '24.27.254.178.dnsbl.sorbs.net.' timed out" id=local-doh-behind-proxy protocol=udp qname=24.27.254.178.dnsbl.sorbs.net.
Feb 12 16:07:50 localhost routedns[17371]: time="2022-02-12T16:07:50+01:00" level=error msg="failed to resolve" addr="127.0.0.1:5353" client=127.0.0.1 error="query for '40.97.237.109.dnsbl.sorbs.net.' timed out" id=local-doh-behind-proxy protocol=udp qname=40.97.237.109.dnsbl.sorbs.net.
Feb 12 16:07:51 localhost routedns[17371]: time="2022-02-12T16:07:51+01:00" level=error msg="failed to resolve" addr="127.0.0.1:5353" client=127.0.0.1 error="query for '40.97.237.109.dnsbl.sorbs.net.' timed out" id=local-doh-behind-proxy protocol=udp qname=40.97.237.109.dnsbl.sorbs.net.
These timeout messages are completely gone after switching to DoH.
Interesting. Would you be able to try and find a minimal configuration where you still see this issue, and maybe a way to trigger it (like # of requests per min or so). Once I can reproduce this I might be able to sort it out.
I'll think about a suitable configuration to further investigate that problem.
Could be throttling by SORBS cutting of some of the queries? DoH is "in session" meaning one-session for all. Looks like DoT is a session per query, or at least more sessions then DoH which maybe hits a throttling threshold at SORBS?
I recognize a little delay when I query SORBS to check an IP address. I guess that's the reason for the timeout messages. Unfortunately I couldn't find any parameter to define the reponse timeout in routedns.
... DoH is "in session" meaning one-session for all. Looks like DoT is a session per query, or at least more sessions then DoH...
Is this really the case? I couldn't notice this behavior. Even with DoT my system establishes one TCP-Session to the upstream resolver which is then used the whole time.
... DoH is "in session" meaning one-session for all. Looks like DoT is a session per query, or at least more sessions then DoH...
Is this really the case? I couldn't notice this behavior. Even with DoT my system establishes one TCP-Session to the upstream resolver which is then used the whole time.
Not sure, I do see (much) more "new connections" when using DoT then when using DoH. Might be a timeout/timing thing that is different between the two. Might be that the DoH implementation has "keep-alive" better implemented (as part of the transport, I am not an expert). In my setup, DoT has lower performance then DoH (where DoH is about 30-50% faster resolving).
When I look at DoT and DoH implementation (NGINX as example), keepalive is always definable as part of DoH as it is part of the transport. Also most DoH libraries/modules support RFC7828 as part of the DNS layer. For DoT is seems to be less of a case. Also have the feeling that development efforts all lean to DoH and of course HTTPS is more reliable and mature as well.
This is not really solving your issues :-). Just wanted to be informative and sharing some experiences! ;-).
This is not really solving your issues :-). Just wanted to be informative and sharing some experiences! ;-).
But it's very interesting anyway, thank you very much! :-)
The issue only occurs when using DoQ. DoH3 (DoH with Quic as transport protocol) works as intended. I've recompiled routedns with newer versions of the dependencies (lucas-clemente/quic-go and the other ones found in go.mod) but that didnt change anything :(
Maybe the problem has something to do with this issues in lucas-clemente/quic-go
It's likely somewhat related. Both of those tickets are effectively for the same underlying problem in the library, the inability to deal with closed connections/sessions after timeout, which is especially bad in the context of DNS. I had to work around that problem in routedns. It's quite possible that the issue you're seeing is related to an incomplete workaround on my part. I don't have time to debug this properly at the moment, but will get back to it
Not sure if it belongs here, but...
When I use the doq
as listener, i get only empty responses when using it.
In the error-log I see this message on the doq
queries coming in:
msg="failed to decode query" addr=":853" client="x.x.x.x:39189" error="dns: overflow unpacking uint16" id=doq protocol=doq
The query I am doing:
q -v @quic://xxx.xxxxxxxxx.com:853 www.paypal.com a
DEBU[0000] RR types: [A]
DEBU[0000] Using scheme: quic host: xxx.xxxxxxxxx.com port: 853
DEBU[0000] Using QUIC transport
DEBU[0000] Dialing with QUIC ALPN tokens: [doq doq-i11]
FATA[0000] empty response from xxx.xxxxxxxxx.com:853
(I am using "q" as lookup tool, I highly recommend it: https://github.com/natesales/q)
When querying against dns.nextdns.io
it works fine.
Using doq
as resolver
(outgoing queries, to for example dns.nextdns.io
) that works fine. Between routedns servers, that works fine too. As soon as I use any DOQ resolver (server or tool), it gives above errors.
There was a change in the doq spec fairly late that required the length to be prepended to the packet, similar to how it works over tcp. I think that's what's happening here. I thought that routedns is updated to expect that length prefix. Perhaps your query tool isn't sending the prefix? Wonder if there's a way to confirm that.
It works with other QUIC based DNS servers or public servers (like Adguard and NextDNS) using the same q
tool.
It does work with this tool (from the Adguard guy): https://github.com/ameshkov/dnslookup
dnslookup www.paypal.com quic://xxx.xxxxxxxxx.com
dnslookup v1.9.1
dnslookup result (elapsed 486.546875ms):
;; opcode: QUERY, status: NOERROR, id: 47465
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version 0; flags:; udp: 4096
; PADDING: 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
;; QUESTION SECTION:
;www.paypal.com. IN A
;; ANSWER SECTION:
www.paypal.com. 1400 IN A 151.101.37.21
Which does the padding and the rest I guess. Seems that RouteDNS is maybe to strict on this and mandating padding/length? Which I think is not a requirement but recommended? As it works regardless with the public QUIC servers?
If you have a recent version of dnslookup it should add the 2-byte prefix as per https://github.com/AdguardTeam/dnsproxy/blob/f8f22ab752e825c655c5174becbfce0a4c430fdc/proxyutil/dns.go#L69-L76 And routedns should be able to understand that. If not that'd be a bug. I could also make it allow the old format without the length-prefix perhaps so it won't fail with older tools
It does and works indeed. Being "backwards compatible" might make sense. DNS standards change quite a lot in short periods, adaptation/implementation of those changes: not at the same pace :-).
Not sure if it is the same, but with the latest build I see these messages popping up now, never had them before. Seems to be load related (Simulate with Browserleaks DNS test).
msg="temporary fail when trying to open stream, attempting new connection" endpoint="x.x.x.x:853" error="Application error 0x0 (remote)" protocol=doq
This is my home RouteDNS instance forwarding to another one I host on internet. Same versions/builds (latest).
That said, I think I am not losing any queries/responses.
@cbuijs Pretty sure this is just a log message that shows up when an old connection has "expired" and a new one needs to be opened. I had to make some small changes there to upgrade the quic library recently, so the message may look new but it probably isn't really. You should be able to repro by making a query, then waiting a couple minutes (depends on what the timeout is server-side), then send another query, I suspect you'll see that message.
That said, do you see a lot of open sockets from the process? I haven't been able to repro an actual leak yet but if there is one that'd be a priority to fix
Number of open socks is pretty low I would say. It is peaky but not in the numbers I would worry about. Also don't stay open for long. Feels normal.
Hey,
I'm seeing a huge amount of the following error messages in my routedns logs.
After some time the system shows a huge amount of unused and open udp sockets, that doesn't get closed.
Do you have any idea where that issues could come from ?
This odd dnsbl.sorbs.net queries are coming from the installed mailserver. Description: https://www.spamhaus.org/faq/section/DNSBL%20Usage)
My configuration:
/opt/routedns/config.toml
/etc/dnsmasq.d/01-pihole.conf
/etc/systemd/system/routedns.service