DNSCrypt / dnscrypt-proxy

dnscrypt-proxy 2 - A flexible DNS proxy, with support for encrypted DNS protocols.
https://dnscrypt.info
ISC License
11.29k stars 1.01k forks source link

High CPU usage and unresponsive on raspberry pi #310

Closed eroglyph closed 6 years ago

eroglyph commented 6 years ago

Running version 2.0.8

After about an hour or so of running perfectly on my raspberry pi with pihole, the dnscrypt-proxy process will spike to 100% cpu and refuse to resolve addresses until I restart the service. I ran an strace and see these messages scrolling rapidly:

clock_gettime(CLOCK_MONOTONIC, {tv_sec=167746, tv_nsec=153292986}) = 0 futex(0x1112d4d0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167746, tv_nsec=156834847}) = 0 futex(0x60bf14, FUTEX_WAIT, 0, {tv_sec=1, tv_nsec=680652141}) = -1 ETIMEDOUT (Connection timed out) clock_gettime(CLOCK_MONOTONIC, {tv_sec=167747, tv_nsec=839275881}) = 0 futex(0x609ab0, FUTEX_WAKE, 1) = 1 futex(0x609a28, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167747, tv_nsec=843012735}) = 0 futex(0x1112d4d0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167747, tv_nsec=845718629}) = 0 futex(0x60bf14, FUTEX_WAIT, 0, {tv_sec=3, tv_nsec=312564161}) = 0 futex(0x609ab0, FUTEX_WAKE, 1) = 1 futex(0x609a28, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167748, tv_nsec=462146459}) = 0 futex(0x1112d4d0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167748, tv_nsec=466667282}) = 0 futex(0x60bf14, FUTEX_WAIT, 0, {tv_sec=2, tv_nsec=482891670}) = -1 ETIMEDOUT (Connection timed out) clock_gettime(CLOCK_MONOTONIC, {tv_sec=167750, tv_nsec=953857759}) = 0 futex(0x609ab0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167750, tv_nsec=957305624}) = 0 futex(0x1112d4d0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167750, tv_nsec=960209510}) = 0 futex(0x60bf14, FUTEX_WAIT, 0, {tv_sec=0, tv_nsec=198271272}) = -1 ETIMEDOUT (Connection timed out) clock_gettime(CLOCK_MONOTONIC, {tv_sec=167751, tv_nsec=159614691}) = 0 futex(0x609ab0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167751, tv_nsec=161604613}) = 0 futex(0x1112d4d0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167751, tv_nsec=163297547}) = 0 futex(0x60bf14, FUTEX_WAIT, 0, {tv_sec=1, tv_nsec=683545038}) = -1 ETIMEDOUT (Connection timed out) clock_gettime(CLOCK_MONOTONIC, {tv_sec=167752, tv_nsec=848639462}) = 0 futex(0x609ab0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167752, tv_nsec=851593346}) = 0 futex(0x1112d4d0, FUTEX_WAKE, 1) = 1 clock_gettime(CLOCK_MONOTONIC, {tv_sec=167752, tv_nsec=854350238}) = 0

Any idea why this might be happening?

jedisct1 commented 6 years ago

No :)

ginkel commented 6 years ago

I am also running dnscrypt-proxy on a Raspi 3 (in a Docker container, though) and have not observed this issue.

What I observed, though, is a really hefty latency penalty of ~ 300 ms for the initial TLS handshake for DoH. Will open a separate issue for this.

jedisct1 commented 6 years ago

@eroglyph Unfortunately, this is something I've never observed, and the strace dump doesn't convey any useful information.

Can you try to find a reliable way to reproduce this and then reopen this ticket? Or find if disabling a specific feature works around this?

eroglyph commented 6 years ago

I'll see what I can do. I coudn't find anything in the logs to indicate what was going on, but I restarted dnsmasq last night and haven been stable so far.

eroglyph commented 6 years ago

@jedisct1 I've got more log info and I'm starting to understand more of what's going on. My wife and I have work VPNs that we're required to use when we work from home. I can get dnscrypt-proxy to consistently fail when we connect to the VPN. I'm thinking that these connections are making more dns requests than dnscrypt-proxy can handle. But even after disconnecting those machines, the process remains locked-out. I've got the dnscrypt-proxy.log below (log level 0)

[2018-04-04 02:38:45] [NOTICE] Source [public-resolvers.md] loaded [2018-04-04 02:38:45] [NOTICE] dnscrypt-proxy 2.0.8 [2018-04-04 02:38:46] [NOTICE] Now listening to 127.0.0.1:40000 [UDP] [2018-04-04 02:38:46] [NOTICE] Now listening to 127.0.0.1:40000 [TCP] [2018-04-04 02:38:46] [NOTICE] Now listening to [::1]:40000 [UDP] [2018-04-04 02:38:46] [NOTICE] Now listening to [::1]:40000 [TCP] [2018-04-04 02:38:51] [ERROR] Get https://dns.cloudflare.com/dns-query?body=yv4BAAABAAAAAAABAAACAAEAACkQAAAAgAAAAA&ct=&dns=yv4BAAABAAAAAAABAAACAAEAACkQAAAAgAAAAA&random_padding=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) [2018-04-04 02:38:51] [NOTICE] dnscrypt-proxy is waiting for at least one server to be reachable [2018-04-04 02:39:02] [NOTICE] [cloudflare] OK (DoH) - rtt: 23ms [2018-04-04 02:39:02] [NOTICE] Server with the lowest initial latency: cloudflare (rtt: 23ms) [2018-04-04 06:39:03] [INFO] [cloudflare] OK (DoH) - rtt: 26ms [2018-04-04 06:39:03] [INFO] [cloudflare] OK (DoH) - rtt: 26ms [2018-04-04 06:39:03] [NOTICE] Server with the lowest initial latency: cloudflare (rtt: 26ms) [2018-04-04 10:39:03] [INFO] [cloudflare] OK (DoH) - rtt: 25ms [2018-04-04 10:39:04] [INFO] [cloudflare] OK (DoH) - rtt: 19ms [2018-04-04 10:39:04] [NOTICE] Server with the lowest initial latency: cloudflare (rtt: 19ms) [2018-04-04 13:40:40] [INFO] Server [cloudflare] returned error code [2] -- It might be experiencing upstream connectivity issues [2018-04-04 13:40:40] [INFO] Server [cloudflare] returned error code [2] -- It might be experiencing upstream connectivity issues [2018-04-04 13:40:40] [INFO] Server [cloudflare] returned error code [2] -- It might be experiencing upstream connectivity issues [2018-04-04 13:40:40] [INFO] Server [cloudflare] returned error code [2] -- It might be experiencing upstream connectivity issues [2018-04-04 13:44:16] [WARNING] Too many connections (max=250) [2018-04-04 13:44:16] [WARNING] Too many connections (max=250) [2018-04-04 13:44:16] [WARNING] Too many connections (max=250) [2018-04-04 13:44:16] [WARNING] Too many connections (max=250) [2018-04-04 13:47:27] [NOTICE] Stopped.

eroglyph commented 6 years ago

@jedisct1 It doesn't look like I can reopen this issue. Should I create a new thread instead?

jedisct1 commented 6 years ago
[2018-04-04 13:40:40] [INFO] Server [cloudflare] returned error code [2] -- It might be experiencing upstream connectivity issues

This may be the root cause.

We got proper responses from the server, but 2 stands for SERVFAIL, i.e. the server responded but with an error.

I can reliably reproduce this just by sending more than 3 queries in a row on the same connection.

The number of maximum connections configured in the proxy is unrelated.

I suspect Cloudflare to be heavily rate limiting, or to enforce a cap of 3 in-flight queries on a single connection.

Just reported this to a friend working on Cloudflare DNS.

jedisct1 commented 6 years ago

dnscrypt-proxy closes the connection and temporarily switches to other servers (if there are more configured) when it starts receiving this kind of error.

It should probably wait a little bit before overreacting to that kind of response code, especially since this doesn't necessarily implies that the server is overloaded.

Gonna try to improve this for the next version.

Meanwhile, in proxy.go, you can replace

        if rcode := Rcode(response); rcode == 2 || rcode == 5 { // SERVFAIL // REFUSED

with

        if rcode := Rcode(response); rcode == 5 {  // REFUSED
eroglyph commented 6 years ago

Modifying proxy.go with your fix did help a little. It no longer spikes on my wife's VPN, but I can still get it to spike using mine. It will go back down to 0% usage a few minutes after I disconnect. I'm still unable to resolve queries during the CPU spike.

I wonder if it is related to #315 in that the HTTPS overhead is too high for the Pi. I don't have any CPU or resolving issues if I forgo using dnscrypt-proxy

eroglyph commented 6 years ago

@ginkel I just tested with the ev-us2 server which is connecting with crypto v2 and saw no CPU/latency spike so I do suspect something going on with either Cloudflare or DoH.

jedisct1 commented 6 years ago

ev-us2 uses the DNSCrypt protocol, which is always faster and requires less CPU than DoH.

With Cloudflare, it will be mitigated once Go merges support for TLS 1.3, that supports 0RTT reconnections.

jedisct1 commented 6 years ago

There is something new for you in the configuration file:

## DoH: Use a specific cipher suite
## 49199 = TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
## 49195 = TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
## 52392 = TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305
## 52393 = TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305

### On non-Intel systems such as Raspberry Pi and other ARM systems, the following is recommended
### Everybody else should not set this parameter, and just use the default suite.

# tls_cipher_suite = [52392, 49199]
mibere commented 6 years ago

Same here with dnscrypt-proxy 2.0.9b2

[INFO] Server [cloudflare] returned error code [2] -- It might be experiencing upstream connectivity issues [INFO] Server [cloudflare-ipv6] returned error code [2] -- It might be experiencing upstream connectivity issues

or

[INFO] Server [d0wn-fr-ns1] returned error code [2] -- It might be experiencing upstream connectivity issues