AdguardTeam / AdGuardHome

Network-wide ads & trackers blocking DNS server
https://adguard.com/adguard-home.html
GNU General Public License v3.0
24.68k stars 1.79k forks source link

Program memory and connection number problems #4505

Closed Potterli20 closed 2 years ago

Potterli20 commented 2 years ago

Dnsproxy and ADH have a common problem, the program memory leak, request IP most close_wait leads to crazy connection, one hour connection number has gone to more than 3W, resulting in network slow down

dnsproxy和adh有个通病,程序内存泄露,请求ip大多数close_wait导致程序疯狂连接,一个小时连接数已经去到3w多,导致网络变慢

Now it's 30 minutes plus 30 watts

ainar-g commented 2 years ago

We cannot reproduce this. Please fill the whole issue template and also provide information about which kinds of upstreams you're using. Also, how do you determine the amount of CLOSE_WAIT sockets? Can you show the command and its output? Thanks.

Potterli20 commented 2 years ago

We cannot reproduce this. Please fill the whole issue template and also provide information about which kinds of upstreams you're using. Also, how do you determine the amount of CLOSE_WAIT sockets? Can you show the command and its output? Thanks.

我采用的是dns分流文件 程序用systemctl 的 内容 /root/dnsproxy/./dnsproxy -u /root/domain_full.txt -l 0.0.0.0 -p 53 -p 58 -p 57 -b 8.8.8.8 --all-servers --edns --cache --cache-optimisti

但是多人用的时候close_wait请求过多 该代码可以查询 netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

注:单人用没有感觉,多人用才会

dns分流文件 https://trli.coding.net/p/file/d/dns-hosts/git/lfs/master/dns-adguardhome/whitelist_full.txt

同时我已经修改了sysctl

ainar-g commented 2 years ago

Thanks for the info, we'll inspect the code and see if we leak any conns.

Potterli20 commented 2 years ago

Thanks for the info, we'll inspect the code and see if we leak any conns.

这个是我自己的/etc/sysctl.conf配置文件 net.ipv4.tcp_retries2 = 8 net.ipv4.tcp_slow_start_after_idle = 0 fs.file-max = 1000000 fs.inotify.max_user_instances = 8192 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_tw_recycle = 1

net.ipv4.ip_local_port_range = 1024 65000 net.ipv4.tcp_max_syn_backlog = 16384 net.ipv4.tcp_max_tw_buckets = 6000 net.ipv4.route.gc_timeout = 15 net.ipv4.tcp_syn_retries = 1 net.ipv4.tcp_synack_retries = 1 net.core.somaxconn = 32768 net.core.netdev_max_backlog = 32768 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_max_orphans = 32768

forward ipv4

net.ipv4.ip_forward = 1 net.core.default_qdisc=fq

net.ipv4.tcp_congestion_control=bbrplus

net.ipv4.ip_conntrack_max = 20000

net.ipv4.tcp_tw_recycle = 0

net.ipv4.tcp_keepalive_time = 15 net.ipv4.tcp_keepalive_probes = 5

kern.ipc.maxsockbuf = 3014656

net.core.rmem_max = 3014656

重要的是 net.core.rmem_max = 3014656

我已经调整sysctl.conf文件好了一点点 但是我的服务每到15分钟重启程序

Potterli20 commented 2 years ago

图片

Potterli20 commented 2 years ago

图片

Potterli20 commented 2 years ago

图片

Potterli20 commented 2 years ago

There's still a problem 还是有问题

I push it myself. I compiled it myself. 我自行push,自行编译了。 Screenshot_2022-04-22-01-02-15-413_com.termux.jpg

ainar-g commented 2 years ago

Did you update the binary before restarting? Because if so, your screenshot shows that the problem is fixed. TIME_WAIT is just the other side not closing the connection from their side, and the newer logs don't seem to show any CLOSE_WAIT sockets.

Potterli20 commented 2 years ago

Did you update the binary before restarting? Because if so, your screenshot shows that the problem is fixed. TIME_WAIT is just the other side not closing the connection from their side, and the newer logs don't seem to show any CLOSE_WAIT sockets.

I actually compiled the above one and ran it for an hour before restarting the program. At the beginning of the program is ok, time began to occupy the port. 其实上面那个我是编译好并运行了一个小时才去重启程序。一开始程序还好了,时间久了开始占用端口。

ainar-g commented 2 years ago

@Potterli20, @fernvenue, this is unrelated to this issue, but I've noticed you two downvoting each other and, sometimes, other posters as well. I don't know why you two do that, but could you please stop? That confuses newcomers, like in #4503, and just generally doesn't improve the quality of conversations in issues. Thanks.

ainar-g commented 2 years ago

I actually compiled the above one and ran it for an hour before restarting the program. At the beginning of the program is ok, time began to occupy the port. 其实上面那个我是编译好并运行了一个小时才去重启程序。一开始程序还好了,时间久了开始占用端口。

Could you look through the netstat output to see what the remote addresses, and especially ports, are? Perhaps this is caused by a particular misbehaving or weirdly behaving upstream. Also, are there any errors in the verbose logs?

fernvenue commented 2 years ago

@ainar-g I have no idea, but I do upvoted for you, if that confuse you so sorry and I will stop using any emoji in this project.

screen

Edited: I have checked and removed all emojis as much as possible, my apologies for that.

ainar-g commented 2 years ago

Upvotes and other reactions are fine, but is regarded as a fairly negative thing, and it's better to not use it unless you also provide a comment regarding the reason. Again, thanks for understanding.

Potterli20 commented 2 years ago

Could you look through the netstat output to see what the remote addresses, and especially ports, are? Perhaps this is caused by a particular misbehaving or weirdly behaving upstream. Also, are there any errors in the verbose logs?

This question feels like a long one. My current program ADH restarts every 6 hours and DNSProxy upstream restarts every 15 minutes. I don't have a lot of data to offer, but CLOSE_Wait has a big impact on Linux. My upstream agent can be found there. 这个问题感觉是好久的问题的。我现在程序adh是每6个小时重启一次,dnsproxy上游每15分钟重启一次。我也提供不了很多数据,我只知道close_wait对于linux影响很大。我的上游代理可以在那里找得到https://github.com/trli-dns/file-scripts/blame/a8698ed8998277737232de716482bd68907f9a21/dns.sh#L141。 Screenshot_2022-04-22-20-03-56-866_com.termux.jpgScreenshot_2022-04-22-20-04-00-344_com.termux.jpgScreenshot_2022-04-22-20-06-30-370_com.termux.jpg

Potterli20 commented 2 years ago

@Potterli20, @fernvenue, this is unrelated to this issue, but I've noticed you two downvoting each other and, sometimes, other posters as well. I don't know why you two do that, but could you please stop? That confuses newcomers, like in #4503, and just generally doesn't improve the quality of conversations in issues. Thanks.

There is a negative state itself, from this issue I do not want to talk #4316 本身就是有消极状态,从这个问题上我已经不想说话了

Potterli20 commented 2 years ago

I actually compiled the above one and ran it for an hour before restarting the program. At the beginning of the program is ok, time began to occupy the port. 其实上面那个我是编译好并运行了一个小时才去重启程序。一开始程序还好了,时间久了开始占用端口。

Could you look through the netstat output to see what the remote addresses, and especially ports, are? Perhaps this is caused by a particular misbehaving or weirdly behaving upstream. Also, are there any errors in the verbose logs?

Oh, right. If DNS writes udp:// protocol, TCP is preferred 哦,对了。如果是写udp://协议的dns,他是优先用tcp

Potterli20 commented 2 years ago

https://github.com/AdguardTeam/dnsproxy/issues/230 https://github.com/AdguardTeam/dnsproxy/issues/165 https://github.com/AdguardTeam/AdGuardHome/issues/4214 https://github.com/AdguardTeam/AdGuardHome/issues/4174 以上是我之前一直提问的问题,一直反复都是这个问题。主要防火墙全开,配置文件和syscet都有调整过。也是有问题。不知道为什么如果是普通用户可能没有发现这个问题,可我一直用着你们的产品,有好多次想换走adh和dnsproxy,但是你们的功能还是很棒的 The above is the question I have been asking before, repeatedly. The main firewalls are on, configuration files and SYSCet have been adjusted. There are problems. I don't know why ordinary users may not find this problem, but I have been using your products and have wanted to change ADH and DNSProxy for many times, but your functions are still excellent

EugeneOne1 commented 2 years ago

@Potterli20, hello again. What exact setup are you checking? We've only pushed the fix into dnsproxy master branch, so that AGH's behavior has no changes yet. Have you also built the AGH from source with dnsproxy module replaced?

Also, have you tried the dnsproxy as a single resolver? Thanks.

ainar-g commented 2 years ago

It is entirely possible that different network environments uncover different bugs in our implementations. We'll keep looking for them. Thanks for all the info you're providing so far.

Potterli20 commented 2 years ago

@Potterli20, hello again. What exact setup are you checking? We've only pushed the fix into dnsproxy master branch, so that AGH's behavior has no changes yet. Have you also built the AGH from source with dnsproxy module replaced?\n\nAlso, have you tried the dnsproxy as a single resolver? Thanks.

I have always used DNSProxy as a pure DNS, not for AD blocking. I have been using CLOSE_wait for a few weeks and dNSproxy is used to stream files with simple DNS. I only changed DNSProxy, adH did not change. Adh may be using my rule 150M, resulting in performance leaks and occasional close-wait issues

Potterli20 commented 2 years ago

It is entirely possible that different network environments uncover different bugs in our implementations. We'll keep looking for them. Thanks for all the info you're providing so far.

But I am using the Chinese network and the international network are the same problem. Close_wait occurs whenever there are too many requests

Potterli20 commented 2 years ago

@Potterli20, hello again. What exact setup are you checking? We've only pushed the fix into dnsproxy master branch, so that AGH's behavior has no changes yet. Have you also built the AGH from source with dnsproxy module replaced?

Also, have you tried the dnsproxy as a single resolver? Thanks.

I'm actually checking for close_WAIT. The close_wait problem is a problem for Linux

EugeneOne1 commented 2 years ago

@Potterli20, to what value the max_goroutines property set in the AGH's configuration file?

Potterli20 commented 2 years ago

@Potterli20, to what value the max_goroutines property set in the AGH's configuration file?

Adh does not set Max, only cache adh是没有设置max的,只设置缓存

EugeneOne1 commented 2 years ago

@Potterli20, in AGH's configuration file there is a field called max_goroutines, please see the wiki page. In dnsproxy the same parameter may be configured via flag option --max-go-routines=<value>.

Could you please try to set both of them into 0 first, and 1000 then and see if the issue affected somehow? Thanks.

Potterli20 commented 2 years ago

Potterli20, in AGH's configuration file there is a field called max_goroutines, please see the wiki page. In dnsproxy the same parameter may be configured via flag option --max-go-routines=\u003Cvalue>.\n\nCould you please try to set both of them into 0 first, and 1000 then and see if the issue affected somehow? Thanks.

This is affected, users will also feel stuck 这个是受到影响的,用户也会感觉会卡

Potterli20 commented 2 years ago

@Potterli20, in AGH's configuration file there is a field called max_goroutines, please see the wiki page. In dnsproxy the same parameter may be configured via flag option --max-go-routines=<value>.

Could you please try to set both of them into 0 first, and 1000 then and see if the issue affected somehow? Thanks.

这个是我本地上游的dns,全部都在另一个机器上处理 This is my local UPSTREAM DNS, all processed on another machine -6275913982690832320_121.jpg