PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.7k stars 908 forks source link

dnsdist 1.1.0-beta1 doesn't release file descriptors #4605

Closed dseira closed 8 years ago

dseira commented 8 years ago

Hi,

I'm trying the last dnsdist 1.1.0-beta1 in CentOS 7 and I can see that the file descriptors are not released. The dnsdist is installed using the repo.

This is the fd usage and the queries graphs for a week:

seleccion_021

seleccion_020

As you can see, the first days, when the file descriptors increased up to 4096 (default max open files configured in the systemd service) the dnsdist produced a segmentation fault and systemd restarted again. Then I modified the max fd up to 65000 just to check if we had too many connections, but as you can see the fd increased without releasing them.

The configuration for the dnsdist is:

newServer({address="1.2.3.4:53", name="pdns-server1", pool="pool1"})
newServer({address="5.6.7.8:53", name="pdns-server2", pool="pool1"})
newServer({address="1.2.3.4:54", name="pdns-recursor1", pool="pool2"})
newServer({address="5.6.7.8:54", name="pdns-recursor2", pool="pool2"})
addPoolRule({"192.168.1.0/24"}, "pool1")`
addPoolRule({"0.0.0.0/0"}, "pool2")
setServerPolicy(roundrobin)
controlSocket("127.0.0.1")
addLocal("0.0.0.0:53", true, true)
setMaxTCPClientThreads(256)
setMaxUDPOutstanding(65535)
pc = newPacketCache(10000, 86400, 0, 60, 60)
getPool("pool1"):setCache(pc)
getPool("pool2"):setCache(pc)
addACL("0.0.0.0/0")
addAction(NotRule(RecordsCountRule(DNSSection.Question, 1, 1)), RCodeAction(dnsdist.REFUSED))
webserver("1.2.3.4:8083", "xxxxxxxxx", "xxxxxxxxxx")
carbonServer("2.2.2.2","dnsdist1","30")

It seems a bug with this new version.

rgacogne commented 8 years ago

Hi David,

Would you mind posting the result of a lsof -p <dnsdist pid>, so we can see what kind of file descriptor dnsdist is leaking? I would also be interested in the result of thread apply all backtrace in gdb if by any chance you have the core dump corresponding to your crash.

rgacogne commented 8 years ago

Oh and if you are using our packages, please install the debug package as well before getting the backtrace with gdb. That way it'll be a lot more useful to us!

dseira commented 8 years ago

The dnsdist lsof out is:

dnsdist.txt

The stacktrace for the dnsdist with debuginfo is:

dnsdist-gdb.txt

rgacogne commented 8 years ago

Thank you! The gdb trace doesn't look like a core dump trace, and the process is running with the same pid that the one in your lsof, so I assume it's a trace of the same dnsdist process as the lsof'd one? If so it looks like we have a dangling FD problem, since several TCP threads should be reading from a pipe but are actually reading from a TCP socket with no connection attached to it. Could you post the result of dumpStats(), showTCPStats(), showBinds() and showServers() from the dnsdist console? Also, if you wouldn't you mind doing a strace -f of the dnsdist process for a few minutes at most, it would help me a lot understanding what's going on.

dseira commented 8 years ago

You're right. I did the gdb in the running dnsdist process.

This is the output you requested:

dumpStats()

acl-drops 0 latency0-1 113297764 block-filter 0 latency1-10 1692190 cache-hits 90038215 latency10-50 3057952 cache-misses 30980505 latency100-1000 2063720 cpu-sys-msec 5158451 latency50-100 670253 cpu-user-msec 2467467 no-policy 0 downstream-send-errors 0 noncompliant-queries 0 downstream-timeouts 32027 noncompliant-responses 0 dyn-block-nmg-size 0 queries 121018725 dyn-blocked 0 rdqueries 121016506 empty-queries 0 real-memory-usage 52449280 fd-usage 23071 responses 30896109 latency-avg100 88763.8 rule-drop 0 latency-avg1000 26155.0 rule-nxdomain 0 latency-avg10000 14141.6 self-answered 0 latency-avg1000000 12071.9 servfail-responses 5818941 latency-slow 152445 trunc-failures 0 uptime 240113

showTCPStats()

Clients MaxClients Queued MaxQueued 256 256 0 0

showBinds()

Address Protocol Queries

0 0.0.0.0:53 UDP 120966356 1 0.0.0.0:53 TCP 52369

showServers()

Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat Outstanding Pools

0 pdns1 1.2.3.4:53 up 0.0 0 1 1 14568 1 0.0 6.5 0 pool1 1 pdns2 5.6.7.8:53 up 0.0 0 1 1 14714 1 0.0 5.5 0 pool1 2 pdns1 1.2.3.4:54 up 0.0 0 1 1 15476707 16831 0.0 136.0 0 pool2 3 pdns2 5.6.7.8:54 up 0.0 0 1 1 15474004 15194 0.0 71.2 0 pool2 All

The strace output:

dnsdist-strace.txt

rgacogne commented 8 years ago

Thank you again! Everything looks normal in the statistics you provided, and unfortunately there is not a single TCP connection in the strace dump. Could you try letting the strace run a little bit longer? I also forgot to ask earlier, but did you check if you have anything interesting in your logs?

dseira commented 8 years ago

Attached new strace:

dnsdist-strace.txt

In the logs I cannot see anything strange. Just a messages regarding the limit of the FD in systemd:

Oct 21 12:52:03 dnsdist1 kernel: dnsdist[9047]: segfault at 0 ip 00007f03eff7276b sp 00007f0265f392c0 error 4 in libc-2.17.so[7f03eff07000+1b7000] Oct 21 12:52:03 dnsdist1 systemd: dnsdist.service: main process exited, code=killed, status=11/SEGV Oct 21 12:51:38 dnsdist1 journal: Suppressed 849805 messages from /system.slice/dnsdist.service

Let me know if the strace is not useful.

rgacogne commented 8 years ago

Unfortunately the new trace has no TCP query either. Based on your graphs you should be leaking one FD every ten seconds or so, perhaps you could let the strace run a bit longer? If GH complains about the file size you can upload it on another site, or even e-mail me at remi.gacogne at powerdns dot com directly since I really would like to understand the issue. I can't remember whether CentOS has moved to coredumpctl, but if it does I would be interested in the backtrace corresponding to the segfault, dnsdist shouldn't crash, even we are out of FDs.

dseira commented 8 years ago

Sure, no problem, I will run the strace more time. But keep in mind that this server is not in production (I had to move around because of this bug), I can do some manual queries.

I can reduce the limit FD for this machine just to try to catch the segfault with the strace. Is this ok?

rgacogne commented 8 years ago

Yes, it would be great! Since you export statistics to a carbon server, do you see anything suspicious there, like the memory or CPU usage increasing, for example?

dseira commented 8 years ago

Tomorrow morning I will do the test. The graphs look good (only the FD usage is growing).

If it can help this is the sysctl of those servers:

net.ipv4.tcp_synack_retries = 2 net.ipv4.conf.all.log_martians = 1 net.ipv4.conf.default.accept_redirects = 0 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv4.ip_forward = 0 net.ipv4.conf.default.secure_redirects = 0 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.tcp_syncookies = 1 net.ipv4.conf.all.send_redirects = 0 net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.all.accept_source_route = 0 net.ipv4.icmp_echo_ignore_broadcasts = 1 vm.swappiness = 0 net.ipv4.conf.all.secure_redirects = 0 kernel.sysrq = 0 net.ipv4.conf.default.send_redirects = 0 net.ipv4.conf.all.accept_redirects = 0 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 net.netfilter.nf_conntrack_max = 131072 net.ipv4.ip_nonlocal_bind = 1 net.ipv4.ip_local_port_range = 1024 65534 fs.file-max = 1048576

FYI, I'm having the same problem with the 1.0.0 version (installed also throught the repository in a centos 7 too):

seleccion_024

It seems the same behaviour.

dseira commented 8 years ago

While trying the test this morning I found the explanation for the increase of the FD. The problem was related to a firewall I have between the dnsdist and the dns-backend; it was applying some rate-limit to the tcp requests and this produced to keep the sockets open. I will keep on eye this week but it seems the problem was gone modifying the firewall rules.

Thanks for all your support ;-)

rgacogne commented 8 years ago

Hi David! Thank you for keeping us updated! I'm still wondering why dnsdist kept the FDs open for so long, but I guess that might happen with some kind of heavy rate-limiting if the TCP queries are served very slowly and new ones keep coming. In that case #4596 might help keeping the number of used FDs under control.

dseira commented 8 years ago

Sure. Thanks!