PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.73k stars 915 forks source link

auth: Memory leak on slave with stale zones #4992

Closed peterthomassen closed 7 years ago

peterthomassen commented 7 years ago

Short description

On a slave with about 1000 stale zones (removed from the master), I observed a memory leak. Things were configured such that the slave would send AXFR requests to the master once a minute. The master correctly declares himself not authoritative.

As long as the stale zones are present on the slave, pdns memory consumption on the slave increased by about 600 MB per day. Other slaves to which the database was replicated do not show this behavior; they have identical zone information, but obviously don't attempt AXFR.

Steps to reproduce

I do not have explicit steps to reproduce at this point. The configuration of the slave showing the memory leak can be found here: https://github.com/desec-io/desec-stack/blob/master/nsmaster/conf/pdns.conf.var

Expected behaviour

Memory usage should not increase

Actual behaviour

Linear increase of memory usage, about 600 MB / day for 1000 stale zones

Other information

discussed on IRC, agreed that this issue should be opened

mind04 commented 7 years ago

I also noticed the leaking with a 4.0.3 server

Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 AXFR of domain 'example.nl' initiated by 1.2.3.4
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 AXFR of domain 'example.nl' allowed: client IP 1.2.3.4 is in allow-axfr-ips
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 TCP Connection Thread died because of STL error: Reading from socket in Signing Pipe loop: Connection reset by peer
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 AXFR of domain 'example.nl' initiated by 2.3.4.5
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 AXFR of domain 'example.nl' allowed: client IP 2.3.4.5 is in allow-axfr-ips
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 TCP Connection Thread died because of STL error: Reading from socket in Signing Pipe loop: Connection reset by peer

Powerdns is leaking memory (middle graph) and filehandles (bottom graph) during axfr out when there is bad data in one of the zones.

screenshot from 2017-02-14 09-21-41

The leaking stopped after removal of the bad dnsname.

rgacogne commented 7 years ago

c23d888db57b340c23476550a7f3689035150cd1 might help with the memory leak. No idea about the FD leak though.

rgacogne commented 7 years ago

Part of the memory leak, or the entire leak, has been fixed in #5177. The FD leak hasn't however.

peterthomassen commented 7 years ago

Just to make sure there is no confusion: PR #5177 deals with a leak on signing errors during outgoing AXFR. However, the memory leak because of which I opened this issue occurs on the slave that receives the AXFR, and there is no signing involving on the slave.

Is PR #5177 entirely unrelated?

rgacogne commented 7 years ago

Hi Peter! You are right, I think #5177 is related to the issue reported by @mind04 but not to yours. Do you have anything in your slave logs when the AXFR fails?

peterthomassen commented 7 years ago

@rgacogne There is no unusual failure in the logs -- just the usual information that the zone was not retrieved (which is correct as the master is not authoritative in the case at hand). If you think it helps, I can reproduce the situation and paste the exact log message here.

Habbie commented 7 years ago

@peterthomassen I cannot reproduce your leak on 4.0.3 or master.

Ubuntu 16.04, both versions compiled from git, using mysql-server. Over several slave cycles (interval 0, 1000 domains of which about 500 have a SOA), I see zero memory increase.

Habbie commented 7 years ago

@mind04 want to file a separate issue about your master-side FD leak so we don't forget?

Habbie commented 7 years ago

@mind04 the FD leak is fixed in master, but I haven't found which commit did that. I think we still leak memory a bit on master.

Habbie commented 7 years ago

@mind04 leaking memory on master happens with mysql on both valid and invalid AXFRs. No leak with sqlite3.

Habbie commented 7 years ago

@mind04's 4.0.x leak as a master has two parts. One part of the memory leak would be fixed by backporting #5177. For the second memory part, which leaks FDs and also more memory, a bisect shows that 90ba52e0e6dcc3efc10cf7738b169a400552e739 fixes it but we can't just backport that, it is very big.

peterthomassen commented 7 years ago

@Habbie, in response to https://github.com/PowerDNS/pdns/issues/4992#issuecomment-291170936: I'm closing the issue then. I'll ask to reopen it if the issue occurs again.

Habbie commented 7 years ago

Thanks!