Control-D-Inc / ctrld

A highly configurable, multi-protocol DNS forwarding proxy
MIT License
400 stars 19 forks source link

ctrld is much slower than manually configuring in the router menu #166

Open VintageEngineer opened 1 month ago

VintageEngineer commented 1 month ago

Interestingly, using the ctrld daemon is 10 to 20ms slower than configuring the router dns in the router menu! (Synology RT2600AC).
Testing was done with Gibson's DNS benchmark. I'm on gigabit fiber in San Diego hitting lax-h03, with status page showing latency of about 7ms.

--------------- ctrld--------manual DoH configuration cached-------- .022--------.008 uncached------.058--------.041 dotcom--------.024--------.011

Actually, the manual configuration is only a bit slower than directly hitting 76.76.2.0 (.006 .040 .010). That's acceptable.

But why is the ctrld daemon so much slower? Is that a bug or something that could be fixed? Maybe the daemon needs to be run at bit higher priority or something?

yegors commented 1 month ago

Legacy DNS is unencrypted and a single packet. Secure DNS is encrypted and has multiple roundtrips + encryption overhead.

This difference is effectively irrelevant, as DNS queries are made occasionally. Every single https:// connection will be 10x slower than DNS resolution, and it happens more frequently.

VintageEngineer commented 1 month ago

Thanks for your quick reply! But... Both entered in the menu and running the daemon use DoH. So the latency difference is not caused by encryption... In fact my numbers above show the encryption barely slows things down. But the daemon adds 10 - 20 ms, doubling the time of some cached and dotcom lookups. My questions are: Why is the ctrld daemon so much slower? Is that a bug or something that could be fixed? Maybe the daemon needs to be run at bit higher priority or something?

yegors commented 1 month ago

Hi,

Please provide detailed debugging data that illustrates the issue. What you have above is ambiguous and does not provide any actionable data.

You can start ctrld in debug mode (check the docs to see how) and provide a log out of the entire resolution. Which would look like this

{"level":"info","time":"2024-07-26T20:28:25-04:00.202","message":"[a956d1] QUERY: 10.0.10.238:61731 (Office-Box) -> listener.0: A github.com"}
{"level":"debug","time":"2024-07-26T20:28:25-04:00.202","message":"[a956d1] no explicit policy matched, using default routing -> [upstream.0]"}
{"level":"debug","time":"2024-07-26T20:28:25-04:00.202","message":"[a956d1] including client info with the request"}
{"level":"debug","time":"2024-07-26T20:28:25-04:00.202","message":"[a956d1] sending query to upstream.0: "}
{"level":"debug","time":"2024-07-26T20:28:25-04:00.202","message":"[a956d1] sending request header: map[X-Cd-Host:[Office-Box] X-Cd-Ip:[10.0.10.238] X-Cd-Mac:[74:56:3c:44:eb:5e] X-Cd-Os:[1-1-6.2]]"}
{"level":"info","time":"2024-07-26T20:28:25-04:00.204","message":"[a956d1] REPLY: upstream.0 -> 10.0.10.238:61731 (Office-Box): NOERROR"}
{"level":"debug","time":"2024-07-26T20:28:25-04:00.204","message":"[a956d1] received response of 67 bytes in 2.5087ms"}

In my case it, the whole process too 2.5ms.

VintageEngineer commented 1 month ago

I'm not sure how to capture logs in a way that allows me to compare the response time from ctrld vs the DoH setup in the router menus... I mean, I can capture ctrld's log, but that doesn't compare the latency with the menu option.

Do you have a router that you could try comparing the latency of ctrld vs the router menu setup yourself? If ctrld is just shim added on the router SW, I could imaging it would add some latency. Is 10 - 20ms expected? Seems like too much to me.

The beauty of Gibson's DNS Benchmark is that it does a LOT of queries and averages them. And it easy pulls the cached vs uncached vs dotcom... not sure how it does that.

It's beer time here in California, I'll consider if I can find a way to get you some more data / logs tomorrow.

VintageEngineer commented 1 month ago

Yegors, thanks for looking at this! I was unable, due to my limited skills, to setup a test where I could get comparison data from the logs. However, I did some additional benchmarking that should be easy for someone else to replicate. I'd really appreciate it if someone else could try this and report back!

This first chart was run from my PC through the Synology RT2600AC router running the ctrld daemon In this chart you can see the local router (boxed) is significantly slower than the controld open free servers. ctrld test The 192.168.77.1 response times are: cached .020 uncached .058 dotcom .023

And this chart was run the same way, but the daemon was stopped and the router configured through it's menus to use DoH: https://dns.controld.com/-resolver- In this chart you can see the performance of the router (boxed) is very similar to the controld open free servers.

menu DoH test The 192.168.77.1 response times are: cached .008 uncached .040 dotcom .009

In this test the ctrld daemon was 12-18ms slower than just configuring the router. Is this just what it takes for the daemon to pass through the requests, on top of the router's latency? Or is there something that can be done to improve this? Maybe the daemon needs to run at higher priority or something?

Notes: The .100 server is my local pi-hole configured to 8.8.8.8 and .254 is the AT&T router provided by my ISP (AT&T fiber).

yegors commented 1 month ago

Honestly, I'm not sure how this ancient piece of software works (I recall briefly using it 12 years ago) or what those names and colors mean, but we would need to see what I asked here: https://github.com/Control-D-Inc/ctrld/issues/166#issuecomment-2253675608

As that provides actionable data from the perspective of the ctrld process.

VintageEngineer commented 1 month ago

I can snatch the logs, but the DNS test software does a zillion requests and I couldn't match the up the request to the log. Is there a way I can use the windows (or mac) command line or powershell to do single cached, uncached and dotcom lookups? I can't see an easy way to do that so we can compare all three.

I guess I could just do some standard lookups. and capture the log... but that won't show us the difference in response time for the daemon vs the router with menu settings. I mean, I haven't proven anything, but the issue seems likely to be in the handoff between the router sw and the daemon and back. We could get an 8ms response in the daemon logs, but the response back through the system could still be 20ms.

I could run the dig command on a mac client, and then pull the log from the daemon for that request. And then I turn off the daemon and setup the the same server in the router menu and run the same dig. But you won't have any logs from the second dig...

VintageEngineer commented 1 month ago

Ok... first I ran dig 5 times with the router configured through it's setup pages to https://dns.controld.com/-resolver- dig query 5 times to controld.com: 9 ms 11ms 14ms 18ms 18ms

Then I deleted the resolver from the router menu turned on the ctrld daemon with logging. dig query 5 times to controld.com: 32ms 39ms 32ms 21ms 30ms

With the daemon, latency averages 16.8ms worse on this dig comparison test.

Here is the log for those 5 using ctrld. They are way faster than what the dig on the mac reports back.

{"level":"info","time":"2024-07-29T15:47:15-07:00.397","message":"[8f2aa2] QUERY: 192.168.77.102:55644 (Carls-Air) -> listener.0: A controld.com"}
{"level":"debug","time":"2024-07-29T15:47:15-07:00.397","message":"[8f2aa2] no explicit policy matched, using default routing -> [upstream.0]"}
{"level":"debug","time":"2024-07-29T15:47:15-07:00.397","message":"[8f2aa2] including client info with the request"}
{"level":"debug","time":"2024-07-29T15:47:15-07:00.397","message":"[8f2aa2] sending query to upstream.0: "}
{"level":"debug","time":"2024-07-29T15:47:15-07:00.398","message":"[8f2aa2] sending request header: map[X-Cd-Host:[Carls-Air] X-Cd-Ip:[192.168.77.102] X-Cd-Mac:[1c:91:80:d9:62:7b]]"}
{"level":"info","time":"2024-07-29T15:47:15-07:00.408","message":"[8f2aa2] REPLY: upstream.0 -> 192.168.77.102:55644 (Carls-Air): NOERROR"}
{"level":"debug","time":"2024-07-29T15:47:15-07:00.408","message":"[8f2aa2] received response of 69 bytes in 11.55942ms"}

{"level":"info","time":"2024-07-29T15:47:16-07:00.489","message":"[4c14fb] QUERY: 192.168.77.102:32922 (Carls-Air) -> listener.0: A controld.com"}
{"level":"debug","time":"2024-07-29T15:47:16-07:00.489","message":"[4c14fb] no explicit policy matched, using default routing -> [upstream.0]"}
{"level":"debug","time":"2024-07-29T15:47:16-07:00.490","message":"[4c14fb] including client info with the request"}
{"level":"debug","time":"2024-07-29T15:47:16-07:00.490","message":"[4c14fb] sending query to upstream.0: "}
{"level":"debug","time":"2024-07-29T15:47:16-07:00.490","message":"[4c14fb] sending request header: map[X-Cd-Host:[Carls-Air] X-Cd-Ip:[192.168.77.102] X-Cd-Mac:[1c:91:80:d9:62:7b]]"}
{"level":"info","time":"2024-07-29T15:47:16-07:00.500","message":"[4c14fb] REPLY: upstream.0 -> 192.168.77.102:32922 (Carls-Air): NOERROR"}
{"level":"debug","time":"2024-07-29T15:47:16-07:00.500","message":"[4c14fb] received response of 69 bytes in 11.12471ms"}

{"level":"info","time":"2024-07-29T15:47:17-07:00.563","message":"[18b241] QUERY: 192.168.77.102:47904 (Carls-Air) -> listener.0: A controld.com"}
{"level":"debug","time":"2024-07-29T15:47:17-07:00.564","message":"[18b241] no explicit policy matched, using default routing -> [upstream.0]"}
{"level":"debug","time":"2024-07-29T15:47:17-07:00.564","message":"[18b241] including client info with the request"}
{"level":"debug","time":"2024-07-29T15:47:17-07:00.564","message":"[18b241] sending query to upstream.0: "}
{"level":"debug","time":"2024-07-29T15:47:17-07:00.565","message":"[18b241] sending request header: map[X-Cd-Host:[Carls-Air] X-Cd-Ip:[192.168.77.102] X-Cd-Mac:[1c:91:80:d9:62:7b]]"}
{"level":"info","time":"2024-07-29T15:47:17-07:00.573","message":"[18b241] REPLY: upstream.0 -> 192.168.77.102:47904 (Carls-Air): NOERROR"}
{"level":"debug","time":"2024-07-29T15:47:17-07:00.574","message":"[18b241] received response of 69 bytes in 10.452885ms"}

{"level":"info","time":"2024-07-29T15:47:18-07:00.683","message":"[06f55f] QUERY: 192.168.77.102:36128 (Carls-Air) -> listener.0: A controld.com"}
{"level":"debug","time":"2024-07-29T15:47:18-07:00.683","message":"[06f55f] no explicit policy matched, using default routing -> [upstream.0]"}
{"level":"debug","time":"2024-07-29T15:47:18-07:00.683","message":"[06f55f] including client info with the request"}
{"level":"debug","time":"2024-07-29T15:47:18-07:00.683","message":"[06f55f] sending query to upstream.0: "}
{"level":"debug","time":"2024-07-29T15:47:18-07:00.684","message":"[06f55f] sending request header: map[X-Cd-Host:[Carls-Air] X-Cd-Ip:[192.168.77.102] X-Cd-Mac:[1c:91:80:d9:62:7b]]"}
{"level":"info","time":"2024-07-29T15:47:18-07:00.693","message":"[06f55f] REPLY: upstream.0 -> 192.168.77.102:36128 (Carls-Air): NOERROR"}
{"level":"debug","time":"2024-07-29T15:47:18-07:00.693","message":"[06f55f] received response of 69 bytes in 10.098813ms"}

{"level":"info","time":"2024-07-29T15:47:19-07:00.745","message":"[93d021] QUERY: 192.168.77.102:32850 (Carls-Air) -> listener.0: A controld.com"}
{"level":"debug","time":"2024-07-29T15:47:19-07:00.745","message":"[93d021] no explicit policy matched, using default routing -> [upstream.0]"}
{"level":"debug","time":"2024-07-29T15:47:19-07:00.745","message":"[93d021] including client info with the request"}
{"level":"debug","time":"2024-07-29T15:47:19-07:00.745","message":"[93d021] sending query to upstream.0: "}
{"level":"debug","time":"2024-07-29T15:47:19-07:00.746","message":"[93d021] sending request header: map[X-Cd-Host:[Carls-Air] X-Cd-Ip:[192.168.77.102] X-Cd-Mac:[1c:91:80:d9:62:7b]]"}
{"level":"info","time":"2024-07-29T15:47:19-07:00.754","message":"[93d021] REPLY: upstream.0 -> 192.168.77.102:32850 (Carls-Air): NOERROR"}
{"level":"debug","time":"2024-07-29T15:47:19-07:00.754","message":"[93d021] received response of 69 bytes in 9.645062ms"}

So this implies to me that the problem is not how the ctrld daemon does the lookup, it is the communication between the daemon and the router, or maybe the priority of the daemon?

(Yegors, I can provide the full logs, but I do not want to post them publicly.) PS: with my limited software skills, this took me a lot of work! I had to learn "dig", and how to export the logs out of the router.

dsohunter commented 1 month ago

I noticed faster speed with the router configured vs the CLI on my WRX560. I changed my config in the CLI to 'v4' under IP Stack (I don't use IPv6 on my router currently). That sped things up for me noticeably.

VintageEngineer commented 1 month ago

How did you confirm whether you are running DoH or DoH3? My config shows https://dns.controld.com/, which I assume means it is running "DoH3", but the status page only says DNS-over-HTTPS. Barry tells me the status page will not tell me if I am specifically running "DoH3".

On Tue, Jul 30, 2024 at 3:33 PM dsohunter @.***> wrote:

I am running a Synology WRX560. I found similar results as you, slower performance with the CLI. I was able to close the gap by using DoH instead of DOH3 in my config.

— Reply to this email directly, view it on GitHub https://github.com/Control-D-Inc/ctrld/issues/166#issuecomment-2259316046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJFBQHFIJZIAXBVZDISXYUTZPAICHAVCNFSM6AAAAABLRKTMJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJZGMYTMMBUGY . You are receiving this because you authored the thread.Message ID: @.***>

dsohunter commented 1 month ago

The status page will say Protocol DNS-over-HTTPS/3.

https://controld.com/status

But I'm thinking the v4 IP stack setting was perhaps more responsible for my improvement (DOH3 is supposed to be faster).

VintageEngineer commented 1 month ago

Guess Barry got that one wrong.

VintageEngineer commented 1 month ago

but we would need to see what I asked here: #166 (comment)

As that provides actionable data from the perspective of the ctrld process.

Yegors, did you get a chance to look at my comment with the logs above? https://github.com/Control-D-Inc/ctrld/issues/166#issuecomment-2257164617

yegors commented 1 month ago

From the perspective of ctrld, queries are taking ~10ms. On your router you're also using dnsmasq, which hands off requests to ctrld, that could have latency associated with it, but shouldn't be much.

You can try running ctrld as a direct listener on port 53, instead of dnsmasq and run those numbers with the Gibson's benchmark.

VintageEngineer commented 1 month ago

I'll see if I can try disabling dnsmasq soon. On a note... do you think ctrld does this on all routers... I mean really slows down the dns requests as it does on my Synology? If true, it might be a tradeoff that is worth documenting: ctrld allows more control, but is/may be significantly slower? The slowdown might not be significant to some people, but for a long time speed has been the way many people pick their DNS.

VintageEngineer commented 1 month ago

I think I'm at my skill limit. I was unable to easily stop dnsmasq on my synology router.

Also, I'd need to change the ctrld config, to move to port 53. And I am unable to change the config from the web panel: I copied my default TOML config file fron the router and made some small changes. Pasted the new file into the web browser "ctrld Configuration" section. Saved. Then I went into the router and did a ctrld restart. Reply was NTC Service restarted. BUT: web panel still shows. "! Not Deployed." So I cannot easily change the config file to move to port 53.

VintageEngineer commented 1 month ago

I was able to try dig from a client and compared dig @routeraddress -p 5354 controld.com and dig @routeraddress -p 53 controld.com

Hitting port 5354 wasn't consistently faster...

VintageEngineer commented 1 month ago

I decided to try the nextdns daemon to compare it's performance to the ctrld daemon. (a side note, it does caching! So that response is obviously faster.)

But, unlike ctrld the nextdns daemon does perform as I would expect. The responses through the nextdns daemon are not much slower than directly hitting their servers.

From Gibson's DNS Benchmark tool:

--------------- ctrld daemon--------nextdns daemon cached-------- .022-----------------.000 (obviously a locally cached response is much faster) uncached------.058-----------------.050 (similar) dotcom--------.024-----------------.010 (nextdns is way faster)

The same is true with tests using dig.

All my testing hints at some kind of issue with how the ctrld daemon passes stuff to the router, and the nextdns daemon apparently does not have this problem.

edit: I think there is a good chance dnsmasq might be part of the issue for ctrld. It looks like nextdns disables dnsmasq by default?

VintageEngineer commented 1 month ago

Posted on reddit last week, and a couple people have replicated my testing, showing the slow response of the ctrld daemon. It's definitely not just me.