PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.59k stars 900 forks source link

ifportup/ifurlup unexpected behavior #12557

Open sfoutrel opened 1 year ago

sfoutrel commented 1 year ago

Short description

ifurlup/ifportup does not use checkresults to answer on first resolv.

Environment

Steps to reproduce

  1. add records to existing or new zone

    tcp.domain-client.tld.fdns.failoverdns.test     0       IN      LUA     A "ifportup(80, {'100.64.43.123', '100.64.43.124','100.64.43.126'})"
    www.domain-client.tld.fdns.failoverdns.test     0       IN      LUA     A "ifurlup('http://www.domain-client.tld/test/', {{'100.64.43.123', '100.64.43.124'},{'100.64.43.126'}})"
  2. add parameters to pdns.conf then restart server expire-delay and check-interval are low value to reproduce multiple times and analyze easily.

    lua-health-checks-expire-delay=10
    lua-health-checks-interval=2
    log-dns-details
    log-dns-queries
    log-timestamp
    loglevel=9
  3. use dig to try first resolv.

    
    dig tcp.domain-client.tld.fdns.failoverdns.test @localhost

; <<>> DiG 9.16.37-Debian <<>> tcp.domain-client.tld.fdns.failoverdns.test @localhost ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52470 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ;; QUESTION SECTION: ;tcp.domain-client.tld.fdns.failoverdns.test. IN A

;; ANSWER SECTION: tcp.domain-client.tld.fdns.failoverdns.test. 0 IN A 100.64.43.124

;; Query time: 4 msec ;; SERVER: ::1#53(::1) ;; WHEN: Thu Feb 16 12:44:42 CET 2023 ;; MSG SIZE rcvd: 88

corresponding logs

Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Remote ::1 wants 'tcp.domain-client.tld.fdns.failoverdns.test|A', do = 0, bufsize = 1232 (4096): packetcache MISS Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.126:80 DOWN: connecting to 100.64.43.126:80 failed: Connection refused Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.123:80 UP! Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.124:80 DOWN: connecting to 100.64.43.124:80 failed: Connection refused



5. a new call to the record gives the UP ip in answer as expected. let pass 10s and try again and you can have a DOWN ip as answer.
6. with ifurlup you can have the failback ip as first resolv even when one of the first ip is up.

### Expected behaviour
using the first check results as answer.

### Actual behaviour
seems to answer one of the ips at random without check
### Other information
<!-- if you already did more digging into the issue, please provide all the information you gathered -->
Habbie commented 1 year ago

This looks like a duplicate of #9529, or am I missing something?

sfoutrel commented 1 year ago

I just saw your answer on #9529. I though the issue was different (He asked if healthckeck could be done when the LUA test was added).

So on first resolv we can have a wrong answer and each time lua-health-checks-expire-delay value is reached. is it a bug or by design ?

Habbie commented 1 year ago

is it a bug or by design ?

Like #9529, this currently is by design.

chbruyand commented 1 year ago

As I don't think we want to change the design, maybe somethingg like https://github.com/PowerDNS/pdns/pull/12558 could help controling the default behavior ?

sfoutrel commented 1 year ago

SPOILER: I'm not a C dev (at last a junior Python enthusiast). @chbruyand Correct me if I read it badly but your proposal would only push the ips in the "candidates" object instead of "unavailables". I don't understand why would that change anything. @Habbie I read all the code and from my understanding you do not wait for the availability of the results of the async checks to answer (to avoid deadlocks ?). I found another issue (which maybe not be one), performance. Does the lua functions use the cache ? I tried setting the cache-ttl to different values but I always get about 5000 req/s instead of 130K req/s with standard records with sql backend.

chbruyand commented 1 year ago

@sfoutrel The first call to if{port,url}up() will always result in a negative return as checks are performed asynchronously and we use that first call to register the port/url. To rephrase, we consider a port/url to be down by default unless a check has been done (asynchronously). My purposal is to have an option to control that default behavior.

sfoutrel commented 1 year ago

I don't get it. Actual version does already that. It consider every port/url to be down so it replies with any ip of the set (according to selector). From my understanding, your proposal is interesting only for ifurlup if we use a failover set of ips. It would answer of the "primary ips" instead of one of all ips (primary and failover).