Closed rootTHC closed 4 years ago
2.0.34-beta.1
Hi,
Your question is:
"What does it mean when dnscrypt-proxy
returns a response with the SERVFAIL
error code?"
Is that correct?
Hi,
Your question is:
"What does it mean when
dnscrypt-proxy
returns a response with theSERVFAIL
error code?"Is that correct?
No. My question is: Why does DCP stop working every few days and requires a manual restart?
Assuming that "DCP" stands for dnscrypt-proxy
.
Can you define "stop working", and list all the required steps to reliably reproduce the problem on a freshly installed operating system?
Assuming that "DCP" stands for
dnscrypt-proxy
.Can you define "stop working", and list all the required steps to reliably reproduce the problem on a freshly installed operating system?
DCP = dnscrypt-proxy
Reliabel steps: install as per instructions on EdgeOS. Wait a few hours or days. DNS will stop working. Logs files and tcpdump output as above.
Restart fixes the problem.
Assuming that "DCP" stands for
dnscrypt-proxy
.Can you define "stop working", and list all the required steps to reliably reproduce the problem on a freshly installed operating system?
definition of 'stop working': dig www.google.co.uk used to return the IP of www.google.co.uk. It does no longer. It 'stopped working'.
What is the minimum set of environmental changes required to trigger the transition from "dig www.google.co.uk
returns an IP address" to "dig www.google.co.uk
returns (something else?)"? (examples: "the external IP address of my router changes", "a server list is reloaded", "my ISP has a connectivity issue that prevents IPv6 from being used")
If it may help, dnscrypt-proxy
has the ability to log individual requests, as well as how they were processed.
In order to enable that feature, look for the [query_log]
section in the configuration file (this is disabled by default)..
Once a query log file has been defined, the file configured in this section will contain one line per received query, along with information such as:
PASS
FORWARD
DROP
REJECT
SYNTH
PARSE_ERROR
NXDOMAIN
RESPONSE_ERROR
SERVFAIL
NETWORK_ERROR
CLOAK
SERVER_TIMEOUT
SERVFAIL
means that the actual resolver wasn't able to resolve a name. This can be due to a misconfiguration of the zone, invalid DNSSEC signatures or a connectivity issue. If you believe the zone is correctly configured, using a different resolver may help.
NETWORK_ERROR
means that packets couldn't be sent to the server. It can be due to a local configuration issue, such as local firewall rules preventing the proxy from establishing a connection to some resolvers.
SERVER_TIMEOUT
means that a resolver failed to respond within a preconfigured time. The example timeouts from the default configuration should be large enough for most networks, but if it persists, choosing servers closer to your ISP may help. SERVER_TIMEOUT
may also mean that the resolver is having a temporary outage, and doesn't respond any more.
Other response return codes are not errors, but may help understand queries that are processed in a specific way, such as being forwarded to a dedicated resolver.
Hi,
that is clear and understood.
The change that triggers the fault: Waiting. No other change happens. Internet itself is working fine.
DCP goes through an automated re-selection process every 4 hours to find the 'best' DNS server to be used (the one with the lowest latency). Every once in a while and always after such a re-selection process, DCP stops working (e.g. dig www.google.co.uk returns SERVFAIL in the query log).
Restarting DCP solves the problem.
Is there any way how I can log which DNS server DCP is trying to use? (I can see from the log file that it is using v.dnscrypt.uk-ipv4 but that's not a valid domain name and neither is v.dnscrypt.uk - neither of them resolved into an IP address).
Can you confirm that www.google.co.uk
returns an IP address, then the list of servers is refreshed, and right after www.google.co.uk
returns a SERVFAIL
error code?
If refreshing has been configured to happen every 4 hour, do query log entries with a SERVFAIL
error code happen exactly 4 hour after the server was started?
The name of the server is v.dnscrypt.uk-ipv4
. If you suspect this server cannot resolve www.google.co.uk
, try confirming it by using it exclusively. You can then exclude it from your resolvers set.
v.dnscrypt.uk-ipv4
currently returns an IP address for www.google.co.uk
:
$ dnslookup www.google.co.uk sdns://AQcAAAAAAAAAEzEwNC4yMzguMTg2LjE5Mjo0NDMg7Uk9jOrXkGZPBjxHt5WaI2ktfJA2PJ5DzLWRe-W0HuUdMi5kbnNjcnlwdC1jZXJ0LnYuZG5zY3J5cHQudWs
www.google.co.uk. 0 IN A 216.58.204.227
But given that this name's TTL is low, temporary connectivity issues for v.dnscrypt.uk-ipv4
can easily return an error later.
Sorry, i was not clear with my error reporting.
My dnscrypt-proxy is running on 192.168.1.1. I hope by using IP addresses and domain name as illustration makes it easier to understand what I am reporting (the bug):
dig @192.168.1.1 www.google.co.uk is working fine for days. No problem here.
It is a default dnscrypt-proxy install. I have not changed the configuration that dnscrypt-proxy is going through a re-selection process every 4h. It's the default config.
However, to the bug: Every few days DCP stops working (e.g. dig @192.168.1.1 www.google.co.uk returns SERVFAIL whereas dig @8.8.8.8 www.google.co.uk is working fine).
The problem does not go automatically. Even after hours dig @192.168.1.1 www.google.co.uk wont work.
I use www.google.co.uk as an illustration/example. It's the same with ANY OTHER domain name.
A manual restart of DCP fixes the problem.
ohh, and v.dnscrypt.uk-ipv4 is used as an illustration as well. The bug is the same even when I limit the server list to 1.1.1.1 (cloudflare) only.
If it may help, dnscrypt-proxy
has the ability to log individual requests, as well as how they were processed.
In order to enable that feature, look for the [query_log] section in the configuration file (this is disabled by default)..
Once a query log file has been defined, the file configured in this section will contain one line per received query. Each line contains some information about received queries and how they were processed.
After the server list is refreshed and something else changes, some queries return an error code or no response at all. The query log will tell you if a response was sent, and give you an indication of the error that happened.
What changes need to be made before the list is refreshed, so that the responses for a given query stop being returned, or return a different error code? Try to find anything that differs between two different times the list is refreshed.
Examples include: "queries are logged with the NETWORK_ERROR
error code when the host uses more than 1024 descriptors" and "queries are not logged any more when the host is running out of memory".
If a previous version of dnscrypt-proxy
didn't exhibit that behavior, try to find what version introduced the change. And, from here, if possible, what exact commit.
All versions of dnscrypt-proxy i tried exhibit the same problem. I have already enabled query-log and posted my query log output together with my initial bug report.
I'm not sure you are helping as you suggest solutions which have already been tested or you suggest that the fault is somewhere else but not in dnscrypt-proxy. This is not the case as other DNS services are operating fine and and it is dnscrypt-proxy which does not. By this i mean: I can get dnscrypt-proxy into a state (see bug report) where dnscrypt-proxy shows PASS in the query-log (see bug report) but no IP address is returned. Any other name service I am running still works fine and returns the correct result for the queried domain name.
I used www.google.co.uk as an example for a domain name. This bug is not related to Google Corp.
I do not mind you closing the issue but the bug will remain and wont get fixed.
hmm. I may have found the problem:
I followed the documentation at https://github.com/DNSCrypt/dnscrypt-proxy/wiki/Installation-on-EdgeOS.
I noticed that when I query DCP directly (dig -p 5353 @192.168.1.1 www.google.co.uk) then DCP still works fine but not when I query it through dnsmasq. Again, please note that most domains still resolved correctly through dnsmasq but around 10% of them did not. In this example www.google.co.uk did not resolve when queried through dnsmasq but resolved fine when DCP was queried directly (-p 5353).
I got it working for all domains by removing 'dnssec' option from /etc/dnsmasq.conf.
It is unclear why www.google.co.uk would not resolve correctly when dnssec was set in the config (I have not investigated this any further).
Platform: EdgeOS system type : UBNT_E120 machine : Unknown processor : 0 cpu model : Cavium Octeon+ V0.1 BogoMIPS : 1000.00 wait instruction : yes microsecond timers : yes tlb_entries : 64 extra interrupt vector : yes hardware watchpoint : yes, count: 2, address/irw mask: [0x0ffc, 0x0ffb] isa : mips1 mips2 mips3 mips4 mips5 mips64r2
dnscrypt-proxy.err: […] [2019-12-08 18:36:49] [NOTICE] - 344ms publicarray-au2 [2019-12-08 18:36:49] [NOTICE] - 361ms publicarray-au2-doh [2019-12-08 18:36:49] [NOTICE] Server with the lowest initial latency: v.dnscrypt.uk-ipv4 (rtt: 12ms)
Query log shows: [2019-12-08 20:13:43] 127.0.0.1 www.google.co.uk A PASS 1ms -
Dig immediately returns and shows: dig www.google.co.uk ;; QUESTION SECTION: ;www.google.co.uk. IN A
;; Query time: 53 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Sun Dec 08 20:13:43 GMT 2019 ;; MSG SIZE rcvd: 45
strace -f -p: [pid 27459] n64_recvfrom() = -1 EAGAIN (Resource temporarily unavailable) […] [pid 25079] <... n64_futex resumed> ) = -1 ETIMEDOUT (Connection timed out) […] [pid 25079] n64_connect() = 0 [pid 25079] n64_write() = 1088 [pid 25079] n64_read() = -1 EAGAIN (Resource temporarily unavailable) [pid 25079] n64_futex( <unfinished ...> […]
tcpdump port 53: 20:34:55.313912 IP 192.168.1.198.55016 > 192.168.1.1.53: 61484+ [1au] A? www.google.co.uk. (45) 20:34:55.322269 IP 192.168.1.1.53 > 192.168.1.198.55016: 61484 ServFail 0/0/1 (45)
Sending SIGHUP makes no difference. Restarting dnscrypt-proxy solves the problem.