AdguardTeam / AdGuardHome

Network-wide ads & trackers blocking DNS server
https://adguard.com/adguard-home.html
GNU General Public License v3.0
25.51k stars 1.83k forks source link

Very slow response from upstream servers with a lot of `context deadline exceeded` #7357

Open SukkaW opened 3 weeks ago

SukkaW commented 3 weeks ago

Prerequisites

Platform (OS and CPU architecture)

Linux, ARM64

Raspberry Pi 4

sukka@sukka-pi:/tmp $ uname -a
Linux sukka-pi 6.6.51+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.51-1+rpt3 (2024-10-08) aarch64 GNU/Linux

Installation

GitHub releases or script from README

Setup

On one machine

AdGuard Home version

v0.107.53

Action

Replace the following command with the one you're calling or a description of the failing action:

nslookup -debug -type=a 'www.example.com' '127.0.0.1'
dig www.example.com @127.0.0.1

Expected result

The result returned immdiately.

Actual result

The result returned in about 20 seconds.

Additional information and/or screenshots

I have set following upstream servers in the DNS settings:

https://120.53.53.53/dns-query
https://1.12.12.12/dns-query
tls://120.53.53.53
tls://1.12.12.12

But when I use the Test upstreams button, AdGuardHome says all those servers could not be used:

image

I have tested those servers locally, they return the result very fast, so definitely no internet connectivity issue, and also there is no problem with the upstream servers:

sukka@sukka-pi:/tmp $ time dig www.example.com @120.53.53.53 +https +short
93.184.215.14

real    0m0.342s
user    0m0.060s
sys     0m0.014s
sukka@sukka-pi:/tmp $ time dig www.example.com @120.53.53.53 +tls +short
93.184.215.14

real    0m0.250s
user    0m0.054s
sys     0m0.025s
sukka@sukka-pi:/tmp $ time dig www.example.com @1.12.12.12 +https +short
93.184.215.14

real    0m0.128s
user    0m0.056s
sys     0m0.024s
sukka@sukka-pi:/tmp $ time dig www.example.com @1.12.12.12 +tls +short
93.184.215.14

real    0m0.185s
user    0m0.059s
sys     0m0.020s

I have also tried with curl:

sukka@sukka-pi:/tmp $ time curl https://120.53.53.53/dns-query

real    0m0.212s
user    0m0.147s
sys     0m0.025s

sukka@sukka-pi:/tmp $ time curl https://1.12.12.12/dns-query

real    0m0.214s
user    0m0.149s
sys     0m0.016s

I then enabled log.verbose in the config file (AdGuardHome.yaml), and I got those logs:

2024/10/18 22:29:36.482822 1090#350 [error] dnsproxy: exchange failed upstream=tls://1.12.12.12:853 question=";www.example.com.\tIN\t A" duration=10.011039607s err="getting conn to tls://1.12.12.12:853: connecting to 1.12.12.12: read tcp 10.10.1.200:60174->1.12.12.12:853: i/o timeout"
2024/10/18 22:29:38.492967 1090#485 [debug] started GET 10.10.1.200:11451 /control/status
2024/10/18 22:29:38.494299 1090#485 [debug] finished GET 10.10.1.200:11451 /control/status in 1.363265ms
2024/10/18 22:29:46.495655 1090#350 [error] dnsproxy: exchange failed upstream=tls://120.53.53.53:853 question=";www.example.com.\tIN\t A" duration=10.012209931s err="getting conn to tls://120.53.53.53:853: connecting to 120.53.53.53: read tcp 10.10.1.200:40744->120.53.53.53:853: i/o timeout"
2024/10/18 22:29:46.496177 1090#350 [debug] dnsproxy: resolving err src=upstream err="all upstreams failed to exchange request: requesting https://120.53.53.53:443/dns-query: Get \"https://120.53.53.53:443/dns-query?dns=AAABAAABAAAAAAABA3d3dwdleGFtcGxlA2NvbQAAAQABAAApCAAAAIAAAAA\": context deadline exceeded\nrequesting https://1.12.12.12:443/dns-query: Get \"https://1.12.12.12:443/dns-query?dns=AAABAAABAAAAAAABA3d3dwdleGFtcGxlA2NvbQAAAQABAAApCAAAAIAAAAA\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\ngetting conn to tls://1.12.12.12:853: connecting to 1.12.12.12: read tcp 10.10.1.200:60174->1.12.12.12:853: i/o timeout\ngetting conn to tls://120.53.53.53:853: connecting to 120.53.53.53: read tcp 10.10.1.200:40744->120.53.53.53:853: i/o timeout"
2024/10/18 22:29:46.496360 1090#350 [debug] dnsforward: finished processing upstream
2024/10/18 22:29:46.496644 1090#350 [debug] dnsproxy: out line_num=1 line=";; opcode: QUERY, status: SERVFAIL, id: 32875"
2024/10/18 22:29:46.496905 1090#350 [debug] dnsproxy: out line_num=2 line=";; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0"
2024/10/18 22:29:46.497117 1090#350 [debug] dnsproxy: out line_num=3 line=""
2024/10/18 22:29:46.497331 1090#350 [debug] dnsproxy: out line_num=4 line=";; QUESTION SECTION:"
2024/10/18 22:29:46.497545 1090#350 [debug] dnsproxy: out line_num=5 line=";www.example.com.\tIN\t A"
2024/10/18 22:29:46.497752 1090#350 [debug] dnsproxy: out line_num=6 line=""
2024/10/18 22:29:46.498229 1090#350 [debug] dnsproxy: handling dns request proto=udp err="using request handler: all upstreams failed to exchange request: requesting https://120.53.53.53:443/dns-query: Get \"https://120.53.53.53:443/dns-query?dns=AAABAAABAAAAAAABA3d3dwdleGFtcGxlA2NvbQAAAQABAAApCAAAAIAAAAA\": context deadline exceeded\nrequesting https://1.12.12.12:443/dns-query: Get \"https://1.12.12.12:443/dns-query?dns=AAABAAABAAAAAAABA3d3dwdleGFtcGxlA2NvbQAAAQABAAApCAAAAIAAAAA\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)\ngetting conn to tls://1.12.12.12:853: connecting to 1.12.12.12: read tcp 10.10.1.200:60174->1.12.12.12:853: i/o timeout\ngetting conn to tls://120.53.53.53:853: connecting to 120.53.53.53: read tcp 10.10.1.200:40744->120.53.53.53:853: i/o timeout"
2024/10/18 22:32:17.276231 1090#486 [debug] started GET 10.10.1.200:11451 /control/status
2024/10/18 22:32:17.277575 1090#486 [debug] finished GET 10.10.1.200:11451 /control/status in 1.398173ms
2024/10/18 22:32:19.917784 1090#486 [debug] started POST 10.10.1.200:11451 /control/test_upstream_dns
2024/10/18 22:32:29.919010 ERROR response received addr=https://1.12.12.12:443/dns-query proto=tcp status="requesting https://1.12.12.12:443/dns-query: Get \"https://1.12.12.12:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
2024/10/18 22:32:29.919133 ERROR response received addr=https://120.53.53.53:443/dns-query proto=tcp status="requesting https://120.53.53.53:443/dns-query: Get \"https://120.53.53.53:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": context deadline exceeded"
2024/10/18 22:32:29.933972 1090#486 [debug] finished POST 10.10.1.200:11451 /control/test_upstream_dns in 10.016169079s
2024/10/18 22:32:48.354208 1090#486 [debug] started GET 10.10.1.200:11451 /control/status
2024/10/18 22:32:48.355457 1090#486 [debug] finished GET 10.10.1.200:11451 /control/status in 1.294076ms
2024/10/18 22:34:47.290447 1090#554 [debug] started GET 10.10.1.200:11451 /control/status
2024/10/18 22:34:47.291737 1090#554 [debug] finished GET 10.10.1.200:11451 /control/status in 1.338358ms
2024/10/18 22:34:48.408058 1090#554 [debug] started POST 10.10.1.200:11451 /control/test_upstream_dns
2024/10/18 22:34:58.409273 ERROR response received addr=https://120.53.53.53:443/dns-query proto=tcp status="requesting https://120.53.53.53:443/dns-query: Get \"https://120.53.53.53:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": context deadline exceeded"
2024/10/18 22:34:58.409278 ERROR response received addr=https://1.12.12.12:443/dns-query proto=tcp status="requesting https://1.12.12.12:443/dns-query: Get \"https://1.12.12.12:443/dns-query?dns=AAABAAABAAAAAAAABHRlc3QAAAEAAQ\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
2024/10/18 22:34:58.445246 1090#554 [debug] finished POST 10.10.1.200:11451 /control/test_upstream_dns in 10.037199239s

Also average upstream servers response time:

image

So somehow AdGuardHome failed to connect to upstream servers while dig works just fine.

xWTF commented 3 weeks ago

Same issue, v0.107.53 takes ~20s to resolve through DoH and throws several generic timeout messages. The problem is gone after reverting back to v0.107.52. Possibly related: https://github.com/AdguardTeam/AdGuardHome/issues/7346 (similar issue from v0.108.0-b.58, which is released between these two versions).

Environment: Linux 4.1.52 aarch64 ASUSWRT-Merlin 3004.388.7

I'm stunned by the huge amount of changes between these two versions, not sure how to investigate what's going on from these 29,523 additions and 27,647 deletions.

SukkaW commented 3 weeks ago

@xWTF Even w/ verbose logging I can't see any useful information. So I assume the most likely cause would be the bumping of go from 1.22 to 1.23.

xWTF commented 3 weeks ago

@xWTF Even w/ verbose logging I can't see any useful information. So I assume the most likely cause would be the bumping of go from 1.22 to 1.23.

Yep, on my device the log simply says "timeout" without any additional information, appears to be a regular connection problem but curl and old version works fine. Nothing suspicious from the changelog / milestone, so the version bump could be the cause. Currently I don't have evidence tho, the changes are just too much for me to review.