Closed peterthomassen closed 7 years ago
I also noticed the leaking with a 4.0.3 server
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 AXFR of domain 'example.nl' initiated by 1.2.3.4
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 AXFR of domain 'example.nl' allowed: client IP 1.2.3.4 is in allow-axfr-ips
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 TCP Connection Thread died because of STL error: Reading from socket in Signing Pipe loop: Connection reset by peer
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 AXFR of domain 'example.nl' initiated by 2.3.4.5
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 AXFR of domain 'example.nl' allowed: client IP 2.3.4.5 is in allow-axfr-ips
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 TCP Connection Thread died because of STL error: Reading from socket in Signing Pipe loop: Connection reset by peer
Powerdns is leaking memory (middle graph) and filehandles (bottom graph) during axfr out when there is bad data in one of the zones.
The leaking stopped after removal of the bad dnsname.
c23d888db57b340c23476550a7f3689035150cd1 might help with the memory leak. No idea about the FD leak though.
Part of the memory leak, or the entire leak, has been fixed in #5177. The FD leak hasn't however.
Just to make sure there is no confusion: PR #5177 deals with a leak on signing errors during outgoing AXFR. However, the memory leak because of which I opened this issue occurs on the slave that receives the AXFR, and there is no signing involving on the slave.
Is PR #5177 entirely unrelated?
Hi Peter! You are right, I think #5177 is related to the issue reported by @mind04 but not to yours. Do you have anything in your slave logs when the AXFR fails?
@rgacogne There is no unusual failure in the logs -- just the usual information that the zone was not retrieved (which is correct as the master is not authoritative in the case at hand). If you think it helps, I can reproduce the situation and paste the exact log message here.
@peterthomassen I cannot reproduce your leak on 4.0.3 or master.
Ubuntu 16.04, both versions compiled from git, using mysql-server. Over several slave cycles (interval 0, 1000 domains of which about 500 have a SOA), I see zero memory increase.
@mind04 want to file a separate issue about your master-side FD leak so we don't forget?
@mind04 the FD leak is fixed in master, but I haven't found which commit did that. I think we still leak memory a bit on master.
@mind04 leaking memory on master happens with mysql on both valid and invalid AXFRs. No leak with sqlite3.
@mind04's 4.0.x leak as a master has two parts. One part of the memory leak would be fixed by backporting #5177. For the second memory part, which leaks FDs and also more memory, a bisect shows that 90ba52e0e6dcc3efc10cf7738b169a400552e739 fixes it but we can't just backport that, it is very big.
@Habbie, in response to https://github.com/PowerDNS/pdns/issues/4992#issuecomment-291170936: I'm closing the issue then. I'll ask to reopen it if the issue occurs again.
Thanks!
Short description
On a slave with about 1000 stale zones (removed from the master), I observed a memory leak. Things were configured such that the slave would send AXFR requests to the master once a minute. The master correctly declares himself not authoritative.
As long as the stale zones are present on the slave, pdns memory consumption on the slave increased by about 600 MB per day. Other slaves to which the database was replicated do not show this behavior; they have identical zone information, but obviously don't attempt AXFR.
Steps to reproduce
I do not have explicit steps to reproduce at this point. The configuration of the slave showing the memory leak can be found here: https://github.com/desec-io/desec-stack/blob/master/nsmaster/conf/pdns.conf.var
Expected behaviour
Memory usage should not increase
Actual behaviour
Linear increase of memory usage, about 600 MB / day for 1000 stale zones
Other information
discussed on IRC, agreed that this issue should be opened