aarond10 / https_dns_proxy

A lightweight DNS-over-HTTPS proxy.
MIT License
775 stars 114 forks source link

https-dns-proxy will sometimes stop responding to any queries, restart is needed to fix #92

Closed Nowaker closed 3 years ago

Nowaker commented 3 years ago

https-dns-proxy will sometimes stop responding to any queries. The response will be typical for a non-functioning DNS server, e.g. from command line using host tool ;; connection timed out; no servers could be reached. I will also see a ton of queries coming to my dnsmasq and forwarded to https-dns-proxy without an answer. /etc/init.d/https-dns-proxy restart immediately fixes the problem.

My https-dns-proxy works in a multi-wan environment. I have multiple WANs configured in failover configuration with mwan3 tool. I also have VPNs, and if a tunnel is established after a physical interface goes up, routing table will change and pass traffic through the VPN tunnel. However, I could not correlate an mwan3 interface up/down event with https-dns-proxy going down as a consequence - but I'm providing the information about mwan3 as a note. All I see is https-dns-proxy will randomly stop responding to queries.

Nowaker commented 3 years ago

I've increased the verbosity level to 3. When this happens again, I'll report what I'm seeing.

Nowaker commented 3 years ago

It happened again! This is what happens internally:

Sep 23 18:19:05 nwkr-omnia dnsmasq[15687]: 56614 SNIPPED/53055 query[A] storage.googleapis.com from SNIPPED
Sep 23 18:19:05 nwkr-omnia dnsmasq[15687]: 56614 SNIPPED/53055 forwarded storage.googleapis.com to 127.0.0.1
Sep 23 18:19:05 nwkr-omnia dnsmasq[15687]: 56614 SNIPPED/53055 forwarded storage.googleapis.com to 127.0.0.1
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: > POST /dns-query HTTP/1.1
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Host: 1.1.1.1
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: User-Agent: dns-to-https-proxy/0.2
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Accept: application/dns-message
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Content-Type: application/dns-message
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Content-Length: 31
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: 
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: * Operation timed out after 206 milliseconds with 0 bytes received
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: > POST /dns-query HTTP/1.1
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Host: 1.1.1.1
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: User-Agent: dns-to-https-proxy/0.2
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Accept: application/dns-message
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Content-Type: application/dns-message
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: Content-Length: 42
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: 
Sep 23 23:19:05 nwkr-omnia https-dns-proxy[4166]: * Operation timed out after 524 milliseconds with 0 bytes received

Cloudflare doesn't block me in any way. I can query the server just fine with curl:

root@nwkr-omnia ~ # curl -H 'accept: application/dns-json' 'https://1.1.1.1/dns-query?name=www.potaroo.net&type=A'
{"Status":0,"TC":false,"RD":true,"RA":true,"AD":true,"CD":false,"Question":[{"name":"www.potaroo.net","type":1}],"Answer":[{"name":"www.potaroo.net","type":1,"TTL":4097,"data":"203.133.248.108"}]}

Definitely something on the proxy side is borked. For now, I'll have to resort to some cron automation to perform /etc/init.d/https-dns-proxy restart if it determined the DNS to be down, but having a permafix would be lovely!

Nowaker commented 3 years ago

I've come across this commit: https://github.com/aarond10/https_dns_proxy/commit/7a825c1f4e5f26915845823fb50dba1bb601a241. @aarond10 Does this ticket sound like something already resolved?

Nowaker commented 3 years ago

I've compiled and installed the latest the head version on my router. I'll report if/when the same problem happens again.

aarond10 commented 3 years ago

Thanks for the reports! I'm not sure if I've solved it with that commit but that was the intention.

I had noticed some weird failures coming out of libcurl also and it led to periods of intermittent resolution failure. It looked like part of curl was rejecting requestes because the resolved IP address pinned in the resolver cache didn't match open sessions. I didn't have enough state (or time) to figure out if I was "holding if wrong" or there was a bug in curl around reused sessions and DNS resolution.

It seems to be better (restarting sessions on IP changes - which is all the time due to DNS round robin). I'm not sure if it's fixed though. Would love more data points.

On Fri, 25 Sep 2020 at 05:43, Damian Nowak notifications@github.com wrote:

I've compiled and installed the latest the head version on my router. I'll report if/when the same problem happens again.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aarond10/https_dns_proxy/issues/92#issuecomment-698549785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABTOXSFV7JNONOPEVI7UGDSHOOM5ANCNFSM4RXHBG3A .

Nowaker commented 3 years ago

20 hours later and no sign of this problem. I would have encountered it by now. I'll hold off from closing the ticket but I think we're in a good place!

aarond10 commented 3 years ago

I just noticed you're host is '1.1.1.1' above so DNS shouldn't have changed anything. I'm still not quite sure why this solved your problem, but happy it did. :) There are clearly some weird things going on for some people and I suspect that it has to do with libcurl but haven't found any smoking guns yet.

Nowaker commented 3 years ago

Welp, it happened again! I had debug mode off this time so I was in the blue and the only thing I could do was to restart, which immediately fixed the problem. And you're right, I explicitly hardcoded 1.1.1.1 so the mentioned commit couldn't have fixed anything for me in the first place.

Enabling debug mode for https-dns-proxy isn't ideal for me. With this setting on, I'm running out of space pretty quick. 36 hours or so, and boom, ENOSPC. Logs are kept in memory on embedded devices to prevent excessive wear on SD cards. Even with 2 GB RAM on my beefy router, it's still not enough to keep up with the volume of logs coming from https-dns-proxy.

Here's a suggestion: can we have an option called "verbose on errors"? This is a typical approach in modern software - quiet on success, verbose on error. Do you think you could do it, so we could track down this bug?

Nowaker commented 3 years ago

@aarond10 Can you respond and reopen? Thanks!

aarond10 commented 3 years ago

Ack. I've been pretty busy over past month or so. Sorry. Will try to get to this in coming week or so.

aarond10 commented 3 years ago

I think 37511cc0 will fix this.

aarond10 commented 3 years ago

Duplicate of #95