DNSCrypt / dnscrypt-proxy

dnscrypt-proxy 2 - A flexible DNS proxy, with support for encrypted DNS protocols.
https://dnscrypt.info
ISC License
11.44k stars 1.01k forks source link

ODoH resolution stops working after upstream key rotation #1797

Closed mschirrmeister closed 3 years ago

mschirrmeister commented 3 years ago

Output of the following commands:

./dnscrypt-proxy -version

2.0.46-beta3

./dnscrypt-proxy -check

[2021-07-30 16:14:03] [NOTICE] dnscrypt-proxy 2.0.46-beta3
[2021-07-30 16:14:03] [NOTICE] Anonymized DNS: routing [odohtarget-ms] via [odohrelay-ms]
[2021-07-30 16:14:03] [NOTICE] Configuration successfully checked

./dnscrypt-proxy -resolve example.com

Resolving [example.com] using 127.0.0.1 port 53

Unable to resolve: [Timeout]

What is affected by this bug?

The ODoH resolution is affected.

When does this occur?

I happens after dnscrypt-proxy is running for a while. I have no exact times yet, but since I think it is related to the key rotation of the ODoH target, it will happen after whatever time the ODoH target rotates the key.

Where does it happen?

It happens for every query that is not cached where a ODoH route will be used.

How do we replicate the issue?

Expected behavior (i.e. solution)

dnscrypt-proxy should be able refresh odoh keys to continue resolving queries via ODoH after servers rotated their keys.

Other Comments

I am testing ODoH and was building a POC environment. For this I am using the doh-server as an ODoH target and dnscrypt-proxy as a local client. The dnscrypt-proxy config has only an ODoH configuration. After some time name resolution stopped working. After debugging I think the problem is related to the key rotation of the ODoH target. In the doh-server code I noticed ODOH_KEY_ROTATION_SECS = 86400.

This 24 hours matches the time after it stopped working today. While name resolution via dnscrypt-proxy was not working, I tested ODoH via a separate ODoH client which pulls for every query the config from the target server first and this name resolution worked just fine.

I did a packet capture on the odoh target and for a request via dnscrypt-proxy I can see the following HTTP response.

Hypertext Transfer Protocol
    HTTP/1.1 401 Unauthorized\r\n
        [Expert Info (Chat/Sequence): HTTP/1.1 401 Unauthorized\r\n]
            [HTTP/1.1 401 Unauthorized\r\n]
            [Severity level: Chat]
            [Group: Sequence]
        Response Version: HTTP/1.1
        Status Code: 401
        [Status Code Description: Unauthorized]
        Response Phrase: Unauthorized
    content-length: 0\r\n
        [Content length: 0]
    date: Fri, 30 Jul 2021 13:56:19 GMT\r\n
    \r\n
    [HTTP response 1/1]
    [Time since request: 0.001175000 seconds]
    [Request in frame: 6]
    [Request URI: http://doh.marco.cx/dns-query]

In the dnscrypt-proxy log I can see the following.

[2021-07-30 16:14:48] [WARNING] Failed to receive successful response from [odohtarget-ms]

Locally where dnscrypt-proxy runs, I never see a connection attempt to the ODoH target. It always keeps only creating new connections to the ODoH relay. Thats why I think it is related to a no longer valid ODoH key.

My dnscrypt-proxy configuration looks like this.

server_names = ['odohtarget-ms', 'odohrelay-ms']
listen_addresses = ['0.0.0.0:53']
bootstrap_resolvers = ['1.1.1.1:53']
ignore_system_dns = true
keepalive = 30
ipv4_servers = true
ipv6_servers = false
dnscrypt_servers = false
doh_servers = false
odoh_servers = true
[static]
  [static.'odohtarget-ms']
  stamp = 'sdns://BQcAAAAAAAAADGRvaC5tYXJjby5jeAovZG5zLXF1ZXJ5'
  [static.'odohrelay-ms']
  stamp = 'sdns://hQcAAAAAAAAAAAANb2RvaC5tYXJjby5jeAYvcHJveHk'
[anonymized_dns]
  routes = [
    { server_name='odohtarget-ms', via=['odohrelay-ms'] }
  ]
jedisct1 commented 3 years ago

New keys should be fetched when a 401 response is received: https://github.com/DNSCrypt/dnscrypt-proxy/blob/796a7f6d31737adf352a4711b9983acc90c72daa/dnscrypt-proxy/proxy.go#L705 but maybe this doesn't work as expected.

Did you get a Forcing key update log entry?

/cc @chris-wood

mschirrmeister commented 3 years ago

I never get anything else than [WARNING] Failed to receive successful response from [odohtarget-ms]. This is with loglevel 0.

You can reproduce it faster by restarting the target, instead of waiting for the key rotate. As soon as the target is restarted, dnscrypt-proxy fails and the message above is logged.

jedisct1 commented 3 years ago

Just a thought: does the relay forward the 401 response? What relay are you using?

jedisct1 commented 3 years ago

I was able to reproduce this, thanks a lot for describe a quick way to do it!

When the key is wrong, responses received via a relay have a 200 return code and an empty content.

Something like this:

diff --git a/dnscrypt-proxy/proxy.go b/dnscrypt-proxy/proxy.go
index cd2e76a0..9a55da2a 100644
--- a/dnscrypt-proxy/proxy.go
+++ b/dnscrypt-proxy/proxy.go
@@ -702,7 +702,7 @@ func (proxy *Proxy) processIncomingQuery(clientProto string, serverProto string,
                                                dlog.Warnf("Failed to decrypt response from [%v]", serverName)
                                                response = nil
                                        }
-                               } else if responseCode == 401 {
+                               } else if responseCode == 401 || (len(responseBody) == 0 && responseCode == 200) {
                                        dlog.Infof("Forcing key update for [%v]", serverInfo.Name)
                                        for _, registeredServer := range proxy.serversInfo.registeredServers {
                                                if registeredServer.name == serverInfo.Name {

works around the issue, effectively updating the key. But we're suppose get a 401 no matter what.

jedisct1 commented 3 years ago

When using odoh-relay-crypto-sx as a relay responses get a proper 401 code. When using odoh-relay-surf and odohrelay-ms, the code is 200.

So my guess is that the service running on these relays doesn't properly forward the HTTP status code.

jedisct1 commented 3 years ago

dnscrypt-proxy also didn't properly store the HTTP status code on error. This is now fixed by cedd4f3b5462195930a52c574d8cab62499fd926

With that fix, along with a relay that properly forwards response codes, everything works as expected.

jedisct1 commented 3 years ago

026c42424fac29f22fcc993e0f65f43aab525511 has a workaround for relays not correctly forwarding the status code.

Hope this is temporary.

I'll release the code of my ODoH relay soon (this is currently running as a WebAssembly module, a proper web server needs to be added in order to have it work as as standalone app).

mschirrmeister commented 3 years ago

@jedisct1 I am using Cloudflares odoh-server-go as a relay. It was the only relay code I found, that I can run myself for testing.

I did not check what status code I received from the relay. But looks like you have been able to test this now already. I have everything running in a container at the moment and capturing packets was not that easy. If needed again I should be able to do it, since I have everything behind a Traefik proxy, where it is plain text on the hosts itself.

Does dnscrypt-proxy has an option to dump TLS session keys for debugging? Similar to some browsers which support the variable SSLKEYLOGFILE.

jedisct1 commented 3 years ago

@chris-wood is this a known issue in doh-server-go? I can look at it if you don't have the time to.

There are no options to dump TLS keys. But looks like this is a very simple thing to do in Go. So maybe that can be added if that proves to be useful more often than once.

Thank you for reporting this, as well as explaining how to reproduce it. That was very useful! Thanks!

mschirrmeister commented 3 years ago

Hello @jedisct1,

first of all thank you very much for fixing it so quickly. Amazing. :-)

I did a test and it works fine. Now I can see a key update message.

[2021-08-01 18:57:56] [INFO] Forcing key update for [odohtarget-ms]

With what relay did you the test that you saw the 200 status code and not a 401?

I do not see the message from commit 026c424. Which means the Cloudflare odoh-server-go forwards the status code correctly. I also wanted to see what it returns and captured traffic on my Traefik container to see what it gets back. The response from the odoh-server-go (relay) was a 401.

To rule out that Traefik does not do anything with the packets, I configured a little debug environment with HTTP in between to capture the traffic that leaves Traefik. I created the following locally. dnscrypt-proxy --443--> traefik1 --80--> traefik2 --4567(http)--> odoh-server-go-relay.

Captured traffic on traefik1 and traefik2 shows the correct status code 401, which I guess rules out that the odoh-server-go has a problem with relaying a 401.

Here is the captured data result.

traefik1

marco@dust /tmp> tshark -r t1.pcap -Y http
  193  24.641678   172.17.0.3 → 172.17.0.4   HTTP 204 2   POST /proxy?body_hash=e93bf4bbdad086e927fcb3cc12f08c9b6dbb93c7bc7dc153047389755b4fe975&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  199  24.678825   172.17.0.4 → 172.17.0.3   HTTP 239 2   HTTP/1.1 401 Unauthorized  (text/plain)
  207  28.044518   172.17.0.3 → 172.17.0.4   HTTP 203 2   POST /proxy?body_hash=9c9253fd605554000afc8214546d636013c9c982ad3aa941ed63aadc189198bd&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  211  28.062317   172.17.0.4 → 172.17.0.3   HTTP 749 2   HTTP/1.1 200 OK  (application/oblivious-dns-message)
  219  28.065993   172.17.0.3 → 172.17.0.4   HTTP 234 2   POST /proxy?body_hash=847f9a632f690c10c4e757826aec0706037c31d3164339bf00be97989be530ca&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  222  28.077862   172.17.0.4 → 172.17.0.3   HTTP 429 2   HTTP/1.1 200 OK  (application/oblivious-dns-message)

traefik2

marco@dust /tmp> tshark -r t2.pcap -Y http
  174  20.833539   172.17.0.3 → 172.17.0.4   HTTP 204 1   POST /proxy?body_hash=e93bf4bbdad086e927fcb3cc12f08c9b6dbb93c7bc7dc153047389755b4fe975&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  178  20.834559   172.17.0.4 → 172.17.0.5   HTTP 204 0   POST /proxy?body_hash=e93bf4bbdad086e927fcb3cc12f08c9b6dbb93c7bc7dc153047389755b4fe975&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  180  20.868819   172.17.0.5 → 172.17.0.4   HTTP 239 0   HTTP/1.1 401 Unauthorized  (text/plain)
  182  20.870549   172.17.0.4 → 172.17.0.3   HTTP 239 1   HTTP/1.1 401 Unauthorized  (text/plain)

  187  24.236501   172.17.0.3 → 172.17.0.4   HTTP 203 1   POST /proxy?body_hash=9c9253fd605554000afc8214546d636013c9c982ad3aa941ed63aadc189198bd&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  190  24.237832   172.17.0.4 → 172.17.0.5   HTTP 203 0   POST /proxy?body_hash=9c9253fd605554000afc8214546d636013c9c982ad3aa941ed63aadc189198bd&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  192  24.253582   172.17.0.5 → 172.17.0.4   HTTP 749 0   HTTP/1.1 200 OK  (application/oblivious-dns-message)
  194  24.254149   172.17.0.4 → 172.17.0.3   HTTP 749 1   HTTP/1.1 200 OK  (application/oblivious-dns-message)
  197  24.257860   172.17.0.3 → 172.17.0.4   HTTP 234 1   POST /proxy?body_hash=847f9a632f690c10c4e757826aec0706037c31d3164339bf00be97989be530ca&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  200  24.258563   172.17.0.4 → 172.17.0.5   HTTP 234 0   POST /proxy?body_hash=847f9a632f690c10c4e757826aec0706037c31d3164339bf00be97989be530ca&targethost=doh.marco.cx&targetpath=%2Fdns-query HTTP/1.1  (application/oblivious-dns-message)
  202  24.269319   172.17.0.5 → 172.17.0.4   HTTP 429 0   HTTP/1.1 200 OK  (application/oblivious-dns-message)
  204  24.269699   172.17.0.4 → 172.17.0.3   HTTP 429 1   HTTP/1.1 200 OK  (application/oblivious-dns-message)

It is best visible on traefik2 capture. tcp.stream 1 is from t1 to t2 (frame 174 and 182) and tcp stream 0 (frame 178 and 180) is from t2 to odoh-server-go. Frames 187+ is from when dnscrypt-proxy retries it after they key update, I guess. (Sent only one single dns request via dig)

jedisct1 commented 3 years ago

I got the 200 response code with odohrelay-surf and your relay.

mschirrmeister commented 3 years ago

Maybe there are specific scenarios or requests when it returns a 200. Good question why I see only the 401.

jedisct1 commented 3 years ago

Maybe I'm just confused and your relay was fine. The surf one was returning 200 for sure.

There was another issue, addressed in that code change, that may have affected responses received from your relay.

Anyway, everything is pretty clear now, and I'm going to leave the workaround until it has been confirmed that all relays properly return 401.

mschirrmeister commented 3 years ago

Thank you very much again. Ones your relay code is available I am happy to test and play around with it too.