DNSCrypt / dnscrypt-proxy

dnscrypt-proxy 2 - A flexible DNS proxy, with support for encrypted DNS protocols.
https://dnscrypt.info
ISC License
11.45k stars 1.01k forks source link

DCP hangs every few days and needs a restart #1099

Closed rootTHC closed 4 years ago

rootTHC commented 4 years ago

Platform: EdgeOS system type : UBNT_E120 machine : Unknown processor : 0 cpu model : Cavium Octeon+ V0.1 BogoMIPS : 1000.00 wait instruction : yes microsecond timers : yes tlb_entries : 64 extra interrupt vector : yes hardware watchpoint : yes, count: 2, address/irw mask: [0x0ffc, 0x0ffb] isa : mips1 mips2 mips3 mips4 mips5 mips64r2

dnscrypt-proxy.err: […] [2019-12-08 18:36:49] [NOTICE] - 344ms publicarray-au2 [2019-12-08 18:36:49] [NOTICE] - 361ms publicarray-au2-doh [2019-12-08 18:36:49] [NOTICE] Server with the lowest initial latency: v.dnscrypt.uk-ipv4 (rtt: 12ms)

Query log shows: [2019-12-08 20:13:43] 127.0.0.1 www.google.co.uk A PASS 1ms -

Dig immediately returns and shows: dig www.google.co.uk ;; QUESTION SECTION: ;www.google.co.uk. IN A

;; Query time: 53 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Sun Dec 08 20:13:43 GMT 2019 ;; MSG SIZE rcvd: 45

strace -f -p: [pid 27459] n64_recvfrom() = -1 EAGAIN (Resource temporarily unavailable) […] [pid 25079] <... n64_futex resumed> ) = -1 ETIMEDOUT (Connection timed out) […] [pid 25079] n64_connect() = 0 [pid 25079] n64_write() = 1088 [pid 25079] n64_read() = -1 EAGAIN (Resource temporarily unavailable) [pid 25079] n64_futex( <unfinished ...> […]

tcpdump port 53: 20:34:55.313912 IP 192.168.1.198.55016 > 192.168.1.1.53: 61484+ [1au] A? www.google.co.uk. (45) 20:34:55.322269 IP 192.168.1.1.53 > 192.168.1.198.55016: 61484 ServFail 0/0/1 (45)

Sending SIGHUP makes no difference. Restarting dnscrypt-proxy solves the problem.

rootTHC commented 4 years ago

./dnscrypt-proxy -version

2.0.34-beta.1

jedisct1 commented 4 years ago

Hi,

Your question is:

"What does it mean when dnscrypt-proxy returns a response with the SERVFAIL error code?"

Is that correct?

rootTHC commented 4 years ago

Hi,

Your question is:

"What does it mean when dnscrypt-proxy returns a response with the SERVFAIL error code?"

Is that correct?

No. My question is: Why does DCP stop working every few days and requires a manual restart?

jedisct1 commented 4 years ago

Assuming that "DCP" stands for dnscrypt-proxy.

Can you define "stop working", and list all the required steps to reliably reproduce the problem on a freshly installed operating system?

rootTHC commented 4 years ago

Assuming that "DCP" stands for dnscrypt-proxy.

Can you define "stop working", and list all the required steps to reliably reproduce the problem on a freshly installed operating system?

DCP = dnscrypt-proxy

Reliabel steps: install as per instructions on EdgeOS. Wait a few hours or days. DNS will stop working. Logs files and tcpdump output as above.

Restart fixes the problem.

rootTHC commented 4 years ago

Assuming that "DCP" stands for dnscrypt-proxy.

Can you define "stop working", and list all the required steps to reliably reproduce the problem on a freshly installed operating system?

definition of 'stop working': dig www.google.co.uk used to return the IP of www.google.co.uk. It does no longer. It 'stopped working'.

jedisct1 commented 4 years ago

What is the minimum set of environmental changes required to trigger the transition from "dig www.google.co.uk returns an IP address" to "dig www.google.co.uk returns (something else?)"? (examples: "the external IP address of my router changes", "a server list is reloaded", "my ISP has a connectivity issue that prevents IPv6 from being used")

jedisct1 commented 4 years ago

If it may help, dnscrypt-proxy has the ability to log individual requests, as well as how they were processed.

In order to enable that feature, look for the [query_log] section in the configuration file (this is disabled by default)..

Once a query log file has been defined, the file configured in this section will contain one line per received query, along with information such as:

SERVFAIL means that the actual resolver wasn't able to resolve a name. This can be due to a misconfiguration of the zone, invalid DNSSEC signatures or a connectivity issue. If you believe the zone is correctly configured, using a different resolver may help.

NETWORK_ERROR means that packets couldn't be sent to the server. It can be due to a local configuration issue, such as local firewall rules preventing the proxy from establishing a connection to some resolvers.

SERVER_TIMEOUT means that a resolver failed to respond within a preconfigured time. The example timeouts from the default configuration should be large enough for most networks, but if it persists, choosing servers closer to your ISP may help. SERVER_TIMEOUT may also mean that the resolver is having a temporary outage, and doesn't respond any more.

Other response return codes are not errors, but may help understand queries that are processed in a specific way, such as being forwarded to a dedicated resolver.

rootTHC commented 4 years ago

Hi,

that is clear and understood.

The change that triggers the fault: Waiting. No other change happens. Internet itself is working fine.

DCP goes through an automated re-selection process every 4 hours to find the 'best' DNS server to be used (the one with the lowest latency). Every once in a while and always after such a re-selection process, DCP stops working (e.g. dig www.google.co.uk returns SERVFAIL in the query log).

Restarting DCP solves the problem.

Is there any way how I can log which DNS server DCP is trying to use? (I can see from the log file that it is using v.dnscrypt.uk-ipv4 but that's not a valid domain name and neither is v.dnscrypt.uk - neither of them resolved into an IP address).

jedisct1 commented 4 years ago

Can you confirm that www.google.co.uk returns an IP address, then the list of servers is refreshed, and right after www.google.co.uk returns a SERVFAIL error code?

If refreshing has been configured to happen every 4 hour, do query log entries with a SERVFAIL error code happen exactly 4 hour after the server was started?

The name of the server is v.dnscrypt.uk-ipv4. If you suspect this server cannot resolve www.google.co.uk, try confirming it by using it exclusively. You can then exclude it from your resolvers set.

jedisct1 commented 4 years ago

v.dnscrypt.uk-ipv4 currently returns an IP address for www.google.co.uk:

$ dnslookup www.google.co.uk sdns://AQcAAAAAAAAAEzEwNC4yMzguMTg2LjE5Mjo0NDMg7Uk9jOrXkGZPBjxHt5WaI2ktfJA2PJ5DzLWRe-W0HuUdMi5kbnNjcnlwdC1jZXJ0LnYuZG5zY3J5cHQudWs
www.google.co.uk.   0   IN  A   216.58.204.227

But given that this name's TTL is low, temporary connectivity issues for v.dnscrypt.uk-ipv4 can easily return an error later.

rootTHC commented 4 years ago

Sorry, i was not clear with my error reporting.

My dnscrypt-proxy is running on 192.168.1.1. I hope by using IP addresses and domain name as illustration makes it easier to understand what I am reporting (the bug):

dig @192.168.1.1 www.google.co.uk is working fine for days. No problem here.

It is a default dnscrypt-proxy install. I have not changed the configuration that dnscrypt-proxy is going through a re-selection process every 4h. It's the default config.

However, to the bug: Every few days DCP stops working (e.g. dig @192.168.1.1 www.google.co.uk returns SERVFAIL whereas dig @8.8.8.8 www.google.co.uk is working fine).

The problem does not go automatically. Even after hours dig @192.168.1.1 www.google.co.uk wont work.

I use www.google.co.uk as an illustration/example. It's the same with ANY OTHER domain name.

A manual restart of DCP fixes the problem.

rootTHC commented 4 years ago

ohh, and v.dnscrypt.uk-ipv4 is used as an illustration as well. The bug is the same even when I limit the server list to 1.1.1.1 (cloudflare) only.

jedisct1 commented 4 years ago

If it may help, dnscrypt-proxy has the ability to log individual requests, as well as how they were processed.

In order to enable that feature, look for the [query_log] section in the configuration file (this is disabled by default)..

Once a query log file has been defined, the file configured in this section will contain one line per received query. Each line contains some information about received queries and how they were processed.

After the server list is refreshed and something else changes, some queries return an error code or no response at all. The query log will tell you if a response was sent, and give you an indication of the error that happened.

What changes need to be made before the list is refreshed, so that the responses for a given query stop being returned, or return a different error code? Try to find anything that differs between two different times the list is refreshed.

jedisct1 commented 4 years ago

Examples include: "queries are logged with the NETWORK_ERROR error code when the host uses more than 1024 descriptors" and "queries are not logged any more when the host is running out of memory".

jedisct1 commented 4 years ago

If a previous version of dnscrypt-proxy didn't exhibit that behavior, try to find what version introduced the change. And, from here, if possible, what exact commit.

rootTHC commented 4 years ago

All versions of dnscrypt-proxy i tried exhibit the same problem. I have already enabled query-log and posted my query log output together with my initial bug report.

I'm not sure you are helping as you suggest solutions which have already been tested or you suggest that the fault is somewhere else but not in dnscrypt-proxy. This is not the case as other DNS services are operating fine and and it is dnscrypt-proxy which does not. By this i mean: I can get dnscrypt-proxy into a state (see bug report) where dnscrypt-proxy shows PASS in the query-log (see bug report) but no IP address is returned. Any other name service I am running still works fine and returns the correct result for the queried domain name.

I used www.google.co.uk as an example for a domain name. This bug is not related to Google Corp.

I do not mind you closing the issue but the bug will remain and wont get fixed.

rootTHC commented 4 years ago

hmm. I may have found the problem:

I followed the documentation at https://github.com/DNSCrypt/dnscrypt-proxy/wiki/Installation-on-EdgeOS.

I noticed that when I query DCP directly (dig -p 5353 @192.168.1.1 www.google.co.uk) then DCP still works fine but not when I query it through dnsmasq. Again, please note that most domains still resolved correctly through dnsmasq but around 10% of them did not. In this example www.google.co.uk did not resolve when queried through dnsmasq but resolved fine when DCP was queried directly (-p 5353).

I got it working for all domains by removing 'dnssec' option from /etc/dnsmasq.conf.

It is unclear why www.google.co.uk would not resolve correctly when dnssec was set in the config (I have not investigated this any further).