AdguardTeam / AdGuardHome

Network-wide ads & trackers blocking DNS server
https://adguard.com/adguard-home.html
GNU General Public License v3.0
24.68k stars 1.79k forks source link

v0.107.31 stops responding to DNS requests, does not terminate #5896

Closed timkgh closed 1 year ago

timkgh commented 1 year ago

Prerequisites

Operating system type

Other (please mention in the description)

CPU architecture

64-bit MIPS

Installation

Other (please mention in the description)

Setup

On a router, DHCP is handled by the router

AdGuard Home version

v0.107.31

Description

v0.107.31 running on EdgeRouter 12, mips64_softfloat

Sometimes it just hangs, it doesn't respond to DNS requests anymore, 1 CPU is at 100% and it also does not gracefully restart via systemctl restart, it needs to be force killed.

I know this is not a lot to go on, these random hangs are hard to troubleshoot. It was stable for a while, I believe this started with the updates to .30 and .31

Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705395 [info] Received signal "terminated"
Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705503 [info] stopping AdGuard Home
Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705544 [info] stopping http server...
Jun 11 21:46:01 router AdGuardHome[15089]: 2023/06/11 21:46:01.705984 [info] stopped http server
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: State 'stop-sigterm' timed out. Killing.
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Killing process 15089 (AdGuardHome) with signal SIGKILL.
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Main process exited, code=killed, status=9/KILL
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Unit entered failed state.
Jun 11 21:47:31 router systemd[1]: AdGuardHome.service: Failed with result 'timeout'.
ainar-g commented 1 year ago

Could you enable the verbose logging and see if there are any clues there? Enabling debug profiling (debug_pprof: true in the configuration file), getting the profile from http://127.0.0.1:6060/debug/pprof/goroutine?debug=1, and sending them to devteam@adguard.com could also help us see if there are any connection leaks.

timkgh commented 1 year ago

Unfortunately I don't have enough space on my router for verbose logging and I'm also worried about flash wear.

poolski commented 1 year ago

Have you checked that it's not a memory issue? If you're on a ER12, that only has 1GB of RAM. It's conceivable that once RAM is used up, it goes to swap.

Would you be able to show the load values from uptime to see whether there's something going on there? When the system starts swapping heavily, it will slow right to a crawl, and services become unresponsive.

As another aside, is the router still behaving as expected? Or is the CPU bottleneck causing router issues too?

timkgh commented 1 year ago

It is not a memory problem, the memory use is normally < 300MB (right now 185MB). The router is fine when this happens, except 1 CPU core is at 100% taken by AGH. Under normal use the CPU cores are at < 5%. The AGH web interface works, AGH just stops responding to DNS requests. I don't need to reboot the router to recover, I just need to kill AGH.

# free -h
              total        used        free      shared  buff/cache   available
Mem:           973M        173M        549M         11M        250M        773M
Swap:            0B          0B          0B

uptime: 07:01:30 up 142 days, 19:26, 1 user, load average: 0.25, 0.25, 0.16

I've reverted to .29 for now because that's been stable for me. Problems started with the recent .30

copacetic82 commented 1 year ago

I am having a simiular random issue and am running the latest v0.107.32 on centos-release-7-9.2009.1.el7.centos.x86_64. AGH stops responding to DNS requests, and still able to access the GUI. Restarting the AdGuardHome service does not work i.e. have to force a reboot. Was not able to check CPU/Memory usage yet, but never had issues before the last 1-2 releases. I did observe the logs below, and have increaed the UDP receiver buffer as described here (https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size). Am waiting to see if the issue stops...

[root@dns1 ~]# systemctl status AdGuardHome ● AdGuardHome.service - AdGuard Home: Network-level blocker Loaded: loaded (/etc/systemd/system/AdGuardHome.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2023-06-16 06:40:36 CEST; 13h ago Main PID: 734 (AdGuardHome) CGroup: /system.slice/AdGuardHome.service └─734 /root/AdGuardHome/AdGuardHome -s run

Jun 16 17:18:11 dns1 AdGuardHome[734]: 2023/06/16 17:18:11.470190 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:18:14 dns1 AdGuardHome[734]: 2023/06/16 17:18:14.472600 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:11 dns1 AdGuardHome[734]: 2023/06/16 17:21:11.454448 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:14 dns1 AdGuardHome[734]: 2023/06/16 17:21:14.464525 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:40 dns1 AdGuardHome[734]: 2023/06/16 17:21:40.538860 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:43 dns1 AdGuardHome[734]: 2023/06/16 17:21:43.802327 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:40:34 dns1 AdGuardHome[734]: 2023/06/16 17:40:34.338152 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:40:37 dns1 AdGuardHome[734]: 2023/06/16 17:40:37.342884 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 18:00:19 dns1 AdGuardHome[734]: 2023/06/16 18:00:19.746670 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 18:00:22 dns1 AdGuardHome[734]: 2023/06/16 18:00:22.771022 [error] dnsproxy: unpacking udp packet: dns: buffer size too small [root@dns1 ~]#

copacetic82 commented 1 year ago

I am having a simiular random issue and am running the latest v0.107.32 on centos-release-7-9.2009.1.el7.centos.x86_64. AGH stops responding to DNS requests, and still able to access the GUI. Restarting the AdGuardHome service does not work i.e. have to force a reboot. Was not able to check CPU/Memory usage yet, but never had issues before the last 1-2 releases. I did observe the logs below, and have increaed the UDP receiver buffer as described here (https://github.com/quic-go/quic-go/wiki/UDP-Receive-Buffer-Size). Am waiting to see if the issue stops...

[root@dns1 ~]# systemctl status AdGuardHome ● AdGuardHome.service - AdGuard Home: Network-level blocker Loaded: loaded (/etc/systemd/system/AdGuardHome.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2023-06-16 06:40:36 CEST; 13h ago Main PID: 734 (AdGuardHome) CGroup: /system.slice/AdGuardHome.service └─734 /root/AdGuardHome/AdGuardHome -s run

Jun 16 17:18:11 dns1 AdGuardHome[734]: 2023/06/16 17:18:11.470190 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:18:14 dns1 AdGuardHome[734]: 2023/06/16 17:18:14.472600 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:11 dns1 AdGuardHome[734]: 2023/06/16 17:21:11.454448 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:14 dns1 AdGuardHome[734]: 2023/06/16 17:21:14.464525 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:40 dns1 AdGuardHome[734]: 2023/06/16 17:21:40.538860 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:21:43 dns1 AdGuardHome[734]: 2023/06/16 17:21:43.802327 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:40:34 dns1 AdGuardHome[734]: 2023/06/16 17:40:34.338152 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 17:40:37 dns1 AdGuardHome[734]: 2023/06/16 17:40:37.342884 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 18:00:19 dns1 AdGuardHome[734]: 2023/06/16 18:00:19.746670 [error] dnsproxy: unpacking udp packet: dns: buffer size too small Jun 16 18:00:22 dns1 AdGuardHome[734]: 2023/06/16 18:00:22.771022 [error] dnsproxy: unpacking udp packet: dns: buffer size too small [root@dns1 ~]#

Did not solve the issue...CPU is at 100%, enough free memory. Reverted back to v0.107.29.

ainar-g commented 1 year ago

@copacetic82, those errors are likely signalling that some plain-UDP clients are sending invalid DNS packets. We'd appreciate if you could reproduce the situation with verbose logging and profile collecting, as described in https://github.com/AdguardTeam/AdGuardHome/issues/5896#issuecomment-1588896375, if you decide to give it another go.

copacetic82 commented 1 year ago

@copacetic82, those errors are likely signalling that some plain-UDP clients are sending invalid DNS packets. We'd appreciate if you could reproduce the situation with verbose logging and profile collecting, as described in https://github.com/AdguardTeam/AdGuardHome/issues/5896#issuecomment-1588896375, if you decide to give it another go.

Hi...thanks for replying...unfortunately i cant reproduce it at the moment as I am away on vacation. But a revert to v0.107.29 did solve the issue for me...no problems since, and using the same clients. Maybe I have time to debug once I get back.

Porygon31 commented 1 year ago

It's happening to me since the last update. I will try the verbose logging

TheWoodsy commented 1 year ago

Same issue. AGH on Ubuntu 22.04 snap v0.107.32 Headless 8700k 32gb Pegged a core at 100% Nothing in logs around time of death in syslog/journal/AdGuardHome.err WebUI works, DNS doesn't. Yikes...

simonregn commented 1 year ago

I face the same problem. Never had issues with AdguardHome until a few weeks. I agree this issue must have started with the last releases.

ainar-g commented 1 year ago

We are still not able to reproduce this properly, and the data we've received so far unfortunately doesn't shed any light on what is happening and why. Anyone still having issues, please help us collect the data by doing the following:

  1. Update to v0.108.0-b.38, which has more debug logging, or a later Edge release.

  2. Enable the verbose logging.

  3. Enable debug profiling (debug_pprof: true in the configuration file).

  4. Wait for the issue to occur.

  5. Save the profile from http://127.0.0.1:6060/debug/pprof/goroutine?debug=1, the verbose log, and send both devteam@adguard.com with “AdGuard Home Issue 5896” in the subject line.

bartounet16 commented 1 year ago

Hello same problem this morning . whith my adguard docker version: v0.108.0-a.586+2069eddf Stop to listen and respond DNS request.

Web UI is working

This problem is during since 1 month !!!

bigwookie commented 1 year ago

5962 Had the same problem here on 0.107.33 have reverted back to 0.107.29 as suggested.

bigwookie commented 1 year ago

We are still not able to reproduce this properly, and the data we've received so far unfortunately doesn't shed any light on what is happening and why. Anyone still having issues, please help us collect the data by doing the following:

  1. Update to v0.108.0-b.38, which has more debug logging, or a later Edge release.
  2. Enable the verbose logging.
  3. Enable debug profiling (debug_pprof: true in the configuration file).
  4. Wait for the issue to occur.
  5. Save the profile from http://127.0.0.1:6060/debug/pprof/goroutine?debug=1, the verbose log, and send both devteam@adguard.com with “AdGuard Home Issue 5896” in the subject line.

Have set this up and left it running.

ainar-g commented 1 year ago

Thanks to the logs sent by @bigwookie in #5962, we were able to pinpoint a bug in the code which could cause such issues. We've fixed it in version v0.108.0-a.600+c4651647 on the Edge channel. Could you please check if the fix improves the situation?

bigwookie commented 1 year ago

Thanks to the logs sent by @bigwookie in #5962, we were able to pinpoint a bug in the code which could cause such issues. We've fixed it in version v0.108.0-a.600+c4651647 on the Edge channel. Could you please check if the fix improves the situation?

@ainar-g I have downloaded this and left it running.

bigwookie commented 1 year ago

@ainar-g I can confirm it's running better.

erdmanbr commented 1 year ago

I've been having the same issue I believe and managed to sort out how to update to the latest Edge release over the weekend. It stalled again last night (running v0.108.0-a.600+c4651647) and I was able to grab and send a log of the event. Will follow up if it happens again since I neglected to enable the pprof option this time.

bigwookie commented 1 year ago

@ainar-g Just failed again will get the logs and send on.

up2you1 commented 1 year ago

just to confirm, i am also getting the same on v0.107.33

  1. LXC Container in Proxmox - the CPU randomly spikes to 100% which causes it to malfunction and not pass any dns through
  2. Docker Container in Unraid - CPU also spikes here too and stops it from working

A reboot of both fixes is temporarily

LXC ADH set to Primary DNS and Docker is Secondary in DHCP

Ive had to switch the LXC Container off as ive no idea how to downgrade it or even use the Edge channel, but ive set the Docker to use Edge and its pulled v0.108.0-a.600+c4651647.

il monitor it

ainar-g commented 1 year ago

@timkgh, were you able to try the Edge fix?

ricardonadao commented 1 year ago

Was having a similar problem where 3 adguard instances stopped randomly working. And kept rolling back to 108.30 beta, and with the latest edge version seems to have solved the issue

timkgh commented 1 year ago

@timkgh, were you able to try the Edge fix?

No, but it doesn't look like it's fixed based on other reports.

ricardonadao commented 1 year ago

Was having a similar problem where 3 adguard instances stopped randomly working. And kept rolling back to 108.30 beta, and with the latest edge version seems to have solved the issue

Close to 2 days without issues, before wouldn't survive 2h sometimes

bartounet16 commented 1 year ago

I dont understand rollback to which version résolve temporrary the issue??? A fact is my first problem was saturday 10 june Adguard is automatically update whith docker Which version has been updated 10 june

timkgh commented 1 year ago

I dont understand rollback to which version résolve temporrary the issue??? A fact is my first problem was saturday 10 june Adguard is automatically update whith docker Which version has been updated 10 june

.29 is stable

ricardonadao commented 1 year ago

.30 also works fine for me.

ricardonadao commented 1 year ago

I dont understand rollback to which version résolve temporrary the issue??? A fact is my first problem was saturday 10 june Adguard is automatically update whith docker Which version has been updated 10 june

Auto updated? Hmm I was expecting to only update when a new image was pulled only and then recreate the container?

You can pull a specific docker image version, that is what I've done to use the .30

docker pull adguard/adguardhome:v0.108.0-b.30

Pulls the specific version and then if you add the label when you create your container should use the image

up2you1 commented 1 year ago

v0.108.0-a.600+c4651647.

So far so good with this version. (Edge)

erdmanbr commented 1 year ago

@ainar-g was kind enough to review my logs from the other day and determined that I had a different issue. After disabling these two reverse DNS options I have been able to get ~36 hours of issue-free uptime. I'm still on the Edge branch but will move back to the master branch with the same configuration this weekend to confirm.

image

He recommended that I follow up here in case any of you have a similar issue.

bartounet16 commented 1 year ago

v0.108.0-a.600+c4651647.

So far so good with this version. (Edge)

I dont understand i thought good version is before v0.107.29 and not 108 ??

timkgh commented 1 year ago

@ainar-g I can try a v0.107.x build if you can make one with the patch. I don't want to move to 108 yet.

up2you1 commented 1 year ago

v0.108.0-a.600+c4651647.

So far so good with this version. (Edge)

I dont understand i thought good version is before v0.107.29 and not 108 ??

Both are good. 107.29 is the version before the issue and 108 edge is the patched version.

the edge version is working fine for me so far.

hopefully latest stable release will also work

up2you1 commented 1 year ago

my docker which was using edge tag has now updated to v0.108.0-a.602+c21f958e from v0.108.0-a.600+c4651647. il keep an eye on it.

bartounet16 commented 1 year ago

how using edge in docker. I have image: adguard/adguardhome:latest and my adguard version is version: v0.107.33 docker pull adguard/adguardhome:edge i think

Ok i have now version: v0.108.0-a.604+7b92d53b If the issue was solved ; what was the problem ?

ricardonadao commented 1 year ago

do we know when the "fix" will be merged to stable and beta branches?

bigwookie commented 1 year ago

You need v0.108.0-a.600+c4651647 on the edge tag.

I have had it stop once, so it is better but still not as stable as 0.107.29

bigwookie commented 1 year ago

OK newest edge v0.108.0-a.604+7b92d53b has stopped and showing the same issue seen here.

I have full logs and sending to you @ainar-g

ricardonadao commented 1 year ago

I have a v0.108.0-a.604+7b92d53b and a a600+c4651647 and they are still ok, but I'm load balancing across 3 instances so I could be not pushing it too much. But I am monitoring with an active healthcheck resolving google.com every 10sec or so

up2you1 commented 1 year ago

All good here on

v0.108.0-a.600+c4651647 v0.108.0-a.602+c21f958e

Running together. Top one is primary dns and Bottom is secondary dns.

Both running in docker on separate hosts.

bigwookie commented 1 year ago

I have a v0.108.0-a.604+7b92d53b and a a600+c4651647 and they are still ok, but I'm load balancing across 3 instances so I could be not pushing it too much. But I am monitoring with an active healthcheck resolving google.com every 10sec or so

I had the edge running for 3 days without issue and then failed, sometimes 5hrs.

I also run two load balanced, an edge release and a 0.107.29 so I don't lose DNS.

ricardonadao commented 1 year ago

I have a v0.108.0-a.604+7b92d53b and a a600+c4651647 and they are still ok, but I'm load balancing across 3 instances so I could be not pushing it too much. But I am monitoring with an active healthcheck resolving google.com every 10sec or so

I had the edge running for 3 days without issue and then failed, sometimes 5hrs.

I also run two load balanced, an edge release and a 0.107.29 so I don't lose DNS.

Same here 3 docker instances in dif VMs:

bigwookie commented 1 year ago

Failed again v0.108.0-a.604+7b92d53b sent full logs again to support.

bartounet16 commented 1 year ago

if i understand well the problem seems to be not resolve ?

ainar-g commented 1 year ago

@bigwookie, thanks again for the logs! We have found another potential issue and we're working on a fix.

timkgh commented 1 year ago

@bigwookie, thanks again for the logs! We have found another potential issue and we're working on a fix.

Are these issues related to the browsing security and parental control services, so disabling them would prevent the issue from occurring?

ainar-g commented 1 year ago

Yes and yes.

ainar-g commented 1 year ago

Version v0.108.0-a.605+c02a1411 on the Edge channel contains the new fix. @bigwookie, could you please try that one out as well?

bigwookie commented 1 year ago

Version v0.108.0-a.605+c02a1411 on the Edge channel contains the new fix. @bigwookie, could you please try that one out as well?

I am now running Version: v0.108.0-a.605+c02a1411