auth: Memory leak on slave with stale zones

peterthomassen commented 7 years ago

Program: Authoritative
Issue type: Bug report

Short description

On a slave with about 1000 stale zones (removed from the master), I observed a memory leak. Things were configured such that the slave would send AXFR requests to the master once a minute. The master correctly declares himself not authoritative.

As long as the stale zones are present on the slave, pdns memory consumption on the slave increased by about 600 MB per day. Other slaves to which the database was replicated do not show this behavior; they have identical zone information, but obviously don't attempt AXFR.

Operating system: Debian Jessie
Software version: 4.0.3-1pdns.jessie, mysql backend
Software source: PowerDNS repository

Steps to reproduce

I do not have explicit steps to reproduce at this point. The configuration of the slave showing the memory leak can be found here: https://github.com/desec-io/desec-stack/blob/master/nsmaster/conf/pdns.conf.var

Expected behaviour

Memory usage should not increase

Actual behaviour

Linear increase of memory usage, about 600 MB / day for 1000 stale zones

Other information

discussed on IRC, agreed that this issue should be opened

mind04 commented 7 years ago

I also noticed the leaking with a 4.0.3 server

Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 AXFR of domain 'example.nl' initiated by 1.2.3.4
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 AXFR of domain 'example.nl' allowed: client IP 1.2.3.4 is in allow-axfr-ips
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:12 ns pdns_server: Feb 11 14:53:12 TCP Connection Thread died because of STL error: Reading from socket in Signing Pipe loop: Connection reset by peer
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 AXFR of domain 'example.nl' initiated by 2.3.4.5
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 AXFR of domain 'example.nl' allowed: client IP 2.3.4.5 is in allow-axfr-ips
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: Found . in wrong position in DNSName .example.nl
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 Signing thread died because of std::exception: failed in writen2: Broken pipe
Feb 11 14:53:32 ns pdns_server: Feb 11 14:53:32 TCP Connection Thread died because of STL error: Reading from socket in Signing Pipe loop: Connection reset by peer

Powerdns is leaking memory (middle graph) and filehandles (bottom graph) during axfr out when there is bad data in one of the zones.

screenshot from 2017-02-14 09-21-41

The leaking stopped after removal of the bad dnsname.

rgacogne commented 7 years ago

c23d888db57b340c23476550a7f3689035150cd1 might help with the memory leak. No idea about the FD leak though.

rgacogne commented 7 years ago

Part of the memory leak, or the entire leak, has been fixed in #5177. The FD leak hasn't however.

peterthomassen commented 7 years ago

Just to make sure there is no confusion: PR #5177 deals with a leak on signing errors during outgoing AXFR. However, the memory leak because of which I opened this issue occurs on the slave that receives the AXFR, and there is no signing involving on the slave.

Is PR #5177 entirely unrelated?

rgacogne commented 7 years ago

Hi Peter! You are right, I think #5177 is related to the issue reported by @mind04 but not to yours. Do you have anything in your slave logs when the AXFR fails?

peterthomassen commented 7 years ago

@rgacogne There is no unusual failure in the logs -- just the usual information that the zone was not retrieved (which is correct as the master is not authoritative in the case at hand). If you think it helps, I can reproduce the situation and paste the exact log message here.