Open george2asenov opened 5 years ago
even bind set with 88 zones and master for PDNS slave when restarted PDNS start looping and using all of the CPU. Also start repeating every second in the log: No new unfresh slave domains, 0 queued for AXFR already, 0 in progress Zone 'domain1.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 Zone 'domain2.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 Zone 'domain3.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 Zone 'domain4.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 ..... No new unfresh slave domains, 0 queued for AXFR already, 0 in progress ..... Zone 'domain1.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 Zone 'domain2.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 Zone 'domain3.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085 Zone 'domain4.com' is on the list of failed SOA checks. Skipping SOA checks until 1552663085
and never stop until restart.
Please enable query logging in your backend and show us the result.
Since this is a test environment, please verify if this issue is still present in auth-4.2.0 beta1. A lot has changed in this area between 4.1.6 and the 4.2.0 beta. Don't forget to add a supermaster=yes in you config when you try 4.2.0.
Packages are available from https://downloads.powerdns.com/autobuilt_browser/#/auth/4.2.0-beta1
Hello,
Here is the log. pdns_server.log This log is made with no access to port 53 from outside except one BIND master (so no DNS queries and no other notifications ). You can see that after the notify and the transfer completes the mysql queries continue with 250 queries per second! I didn't have the time to test with 4.2.0 but I will and will let you know the results.
I have tested with the 4.2.0 beta release and it behaves the same. If it receive mass notify it starts repeating almost every second
Mar 26 03:24:05 cent7 pdns_server: Zone 'domain68818.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain41903.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain92984.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain96516.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain60143.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain81484.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain67483.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain09248.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain85737.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain01922.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain56140.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain85547.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain81831.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain94173.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain83262.com' is on the list of failed SOA checks. Skipping SOA checks until 1553588703 Mar 26 03:24:05 cent7 pdns_server: Zone 'domain85207.com' is on the list of failed SOA checks. Skipping SOA checks until 155358870 .... If I activate the Query logging when it stop logging the previous it start repeating this query with the same ID "139677571072384" all the time:
Mar 26 04:57:19 Query 139677571072384: SELECT content,ttl,prio,type,domain_id,disabled,name,auth FROM records WHERE disabled=0 and type=? and name=? Mar 26 04:57:19 Query 139677571072384: 97 usec to execute Mar 26 04:57:19 Query 139677571072384: 135 total usec to last row
then the first one again and repeat.
So far unable to reproduce using a pdns master and slave. Will setup a bind master later.
BTW, the XXX ip addresses in your example config might hide a problem. It could be that self-notification or a notification loop plays a role. We ask people to supply all the details with good reason. See https://blog.powerdns.com/2016/01/18/open-source-support-out-in-the-open/
With bind as a master and 1000 domains I still cannot reproduce. That is, I see a peak in CPU usage with maybe a bit excessive logging, but that lasts only a short time.
The only thing I can advise is tor try to find what is different in your setup to try to zoom in on the actual cause.
The comment about sharing a complete unredacted config still applies.
As it appears to be a bit difficult to arrive at a setup that reproduces this, I created a docker-compose setup that reliably causes this: https://github.com/peterthomassen/pdns-cluster (It's just a standard master-slave setup, so one can also use it for a general pdns docker project.)
Steps to reproduce:
git clone https://github.com/peterthomassen/pdns-cluster && cd pdns-cluster/
docker-compose build
docker-compose up
for i in $(seq 1 1000); do echo '{"name": "'$i'.foobar.test.", "kind": "MASTER", "nameservers": ["ns1.dns.test.", "ns2.dns.test."]}' | curl -sS -v -X POST -d@- http://localhost:8081/api/v1/servers/localhost/zones -H "X-API-Key: nsmasterapikey"; done
to created a thousand domainsdbslave
container): docker-compose exec dbslave tail -f /var/log/mysql/log
You will see that a large number of Reset stmt
will be logged by the database. Once things have calmed down, /var/log/mysql/log
will be about 1.5M lines long, most of which will be the Reset stmt
queries.
You can trigger this again any time by sending a notify:
for i in $(seq 1 1000); do echo $i; curl -X PUT http://localhost:8081/api/v1/servers/localhost/zones/$i.foobar.test./notify -H "X-API-Key: nsmasterapikey"; done
The number of queries is not linear in the number of zones. My experiments show 220 statements for 10 zones, about 10,000 statements for 100 zones, and about 1,000,000 statements for 1,000 zones. I stopped the measurement for 10,000 zones at some point (although this is my actual use case).
Thanks. I'll investigate further, hopefully soon.
The problem is in CommunicatorClass::mainloop()
Each time a notification is received, the d_any_sem
is posted (incremented). This triggers a slave out of date check for all slaves for each increment in the (tight) outer loop.
Looking into the best way to fix this.
At the risk of having misunderstood: That sounds like the observed database requests should be SELECT and not Reset, and also their number should be proportional to the number of notifications, right? My investigations further up show different behavior.
I cannot explain the resets yet. I'm using sqlite, but i will install mysql shortly.
The hypothesis is that N notifications lead t N times checking of all slave data. Hence the quadratic behaviour (number of notifications * number of zones). If you notify all zones that's N^2.
Thew stack trace showed the thread to be in getUnfreshSlaveInfos()
almost all the time.
With mariadb I'm seeing the selects, but each select is paired with a single reset. I'm not seeing a load of resets at the end as you do....
After some thinking and playing it turned out that the mariadb query cache is causing selects to not appear in the log. If I enable the query cache i'm seeing the same patterns as you do.
Fix for master branch upcoming.
Not sure if this is same issue, but on pdns slave I am getting 100% CPU usage for ca. 10s after each notify - even for single zone (pdns 4.4.1, gsqlite backend, ca. 192K domains and 1.4M records).
Is there any real reason to check all slave zones freshness when notify is affecting only one (or even some)?
Short description
After receiving many NOTIFY queries at once (like when restarting BIND master with more than 1000 domains in this case) PDNS (working as slave - BIND -> PDNS superslave/supermaster -> PDNS slave yes it happen on the two PDNS servers in the chain ) and MYSQL start to consume ~100% constantly till next restart. This continues even after all AXFR-s are completed and log stop logging anything ( at loglevel=6) and no DNS queries (just in test environment). Just doing nothing but keep CPU consumption high.
issue: https://github.com/PowerDNS/pdns/issues/622 is the same but appears to be resolved back then. Apparently it is not.
Environment
Steps to reproduce
(Lua) scripts that are loaded. --> none
Expected behaviour
After all AXFR complete pdns to become idle and consume ~0%CPU
Actual behaviour
PDNS stuck on using too much cpu for an idling service. strace:
this continue indefinitely with different zones.
pdns.conf: config-dir=/etc/powerdns daemon=yes allow-axfr-ips=XXX.XXX.XXX.XXX/32 disable-axfr=no guardian=yes local-address=0.0.0.0 local-port=53 log-dns-details=on loglevel=6 module-dir=/usr/lib64/pdns master=no slave=yes slave-cycle-interval=120 setgid=pdns setuid=pdns socket-dir=/var/run version-string=powerdns launch=gmysql gmysql-host=localhost gmysql-user=pdns gmysql-dbname=pdns gmysql-password=@@@@@@ slave-renotify=yes only-notify= also-notify=XXX.XXX.XXX.XXX webserver=yes webserver-address=XXX.XXX.XXX.XXX webserver-allow-from=XXX.XXX.XXX.XXX api=yes api-key=TTTTTT gmysql-dnssec=yes retrieval-threads=4 receiver-threads=4 signing-threads=4 distributor-threads=4 soa-refresh-default=86000
no DNSSEC enabled on any domain.