PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.67k stars 906 forks source link

auth: pipebackend lookups twice for the same qname+qtype #11616

Open pawlakus opened 2 years ago

pawlakus commented 2 years ago

Short description

Something changed with the lookup algorithm. pdns_serve with libpipebackend.so is asking twice for the same qname+qtype when not in APEX. We're unable to do WRR because of this, as pdns_serve is asks twice and respond with second answer, ignoring the first answer. Our backend also have to do two geoip-lookups for every query.

Environment

Steps to reproduce

  1. Config:
    
    local-port=2253
    local-address=127.0.0.1 ::1
    version-string=xyz
    server-id=disabled

resolver=127.0.0.1:1153 expand-alias=yes query-cache-ttl=0 cache-ttl=0 zone-cache-refresh-interval=0 zone-metadata-cache-ttl=5 consistent-backends=yes

any-to-tcp=yes disable-axfr=yes edns-subnet-processing=yes

primary=no launch=pipe pipe-abi-version=3 pipe-command=/tmp/pybackend.sock distributor-threads=16

2. Our custom backend is being asked twice for the same qname+qtype and discard first answer.

### Expected behaviour
For qname+qtype outside apex, pdns_serve should not ask twice, It breaks Weighted Round Robin on the backend and it have to process two GeoIP lookups for every query.

### Actual behaviour
1. (correct) pdsn_serve does multiple SOA lookup to establish zone, recursivelly from left-to-right.
2. (incorrect) Once SOA is received, then pdns_server lookups twice for record outside apex `test.domain.com` (zone=`test.domain.com`, qname=`random.test.domain.com`). Bellow is a test run with `random.test.domain.com` generating random pair of A and AAAA records, one each.

dig +subnet=$DE -p 2253 @127.0.0.1 random.test.domain.com A

; <<>> DiG 9.10.6 <<>> +subnet=152.70.186.215 -p 2253 @127.0.0.1 random.test.domain.com A ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7856 ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1232 ; CLIENT-SUBNET: 152.70.186.215/32/0 ;; QUESTION SECTION: ;random.test.domain.com. IN A

;; ANSWER SECTION: random.test.domain.com. 45 IN A 28.183.97.53

;; Query time: 2 msec ;; SERVER: 127.0.0.1#2253(127.0.0.1) ;; WHEN: Tue May 10 02:02:56 CEST 2022 ;; MSG SIZE rcvd: 79


BACKEND log:

(... 16 threads ) INFO:main:Socket connected DEBUG:main:HELO 3 DEBUG:main:OK pybackend.py INFO:main:Socket connected DEBUG:main:HELO 3 DEBUG:main:OK pybackend.py INFO:main:Socket connected DEBUG:main:HELO 3 DEBUG:main:OK pybackend.py DEBUG:main:Q random.test.domain.com IN SOA -1 0.0.0.0 0.0.0.0 0.0.0.0/0 DEBUG:main:END DEBUG:main:Q test.domain.com IN SOA -1 0.0.0.0 0.0.0.0 0.0.0.0/0 DEBUG:main:DATA 0 1 test.domain.com IN SOA 180 -1 ns1.test.domain.com hostmaster.test.domain.com 1 900 300 172800 180 DEBUG:main:END DEBUG:main:Q random.test.domain.com IN ANY -1 127.0.0.1 127.0.0.1 152.70.186.215/32 DEBUG:main:DATA 0 1 random.test.domain.com IN A 45 -1 39.8.113.118 DEBUG:main:DATA 0 1 random.test.domain.com IN AAAA 45 -1 2603:9999:0:0:B5E9:70CC:54B1:3FDB DEBUG:main:END DEBUG:main:Q random.test.domain.com IN ANY -1 127.0.0.1 127.0.0.1 152.70.186.215/32 DEBUG:main:DATA 0 1 random.test.domain.com IN A 45 -1 28.183.97.53 DEBUG:main:DATA 0 1 random.test.domain.com IN AAAA 45 -1 2603:9999:0:0:A861:7BB3:A13D:B07C DEBUG:main:END



In the above, pdns_server issued two Q lines for random.test.domain.com/ANY to our backend and responded with the last answer, `39.8.113.118` and `2603:9999:...:3FDB` being ignored.

### Other information
In 4.3.2, lookup algorith worked as follows:
1. Querying from left-to-right to establish SOA (zone)
2. For qname outside apex, it asked for qname+qtype/NS to check for child delegation. In 4.6.2, this query is qname+qtype/ANY (incorrect) instead of qname+qtype/NS (correct)
3. If no delegation, then issue a final qname+qtype/ANY

In 4.6.2, step #2 issue qname+qtype/ANY for child delegation checking, instead of qname+qtype/NS. This is why step #2 and step #3 are the same in 4.6.2 and out backend has to process the same query twice.

How it should process: if step #2 doesn't have any NS in the answer, then pdns_server shouldn/t lookup #3.
pawlakus commented 2 years ago
#!/usr/bin/env python3
import os, asyncio, random, logging, signal
socket_path = "/tmp/pybackend.sock"
logger = logging.getLogger(__name__)

def randomA():
    return '.'.join(str(random.randint(1, 223)) for i in range(4))

def randomAAAA():
    return '2603:9999:0:0:' + ':'.join('{:0>4X}'.format(random.randint(1024,64000)) for i in range(4))

def process_query(qname, qtype, client_ip, edns_subnet):
    data = "DATA\t0\t1\t{}\tIN\t{{qtype}}\t{}\t{}\t{{rdata}}\n".format(qname, 3600, -1)
    soa = data.format(qtype="SOA",rdata="ns1.test.domain.com hostmaster.test.domain.com 1 900 300 172800 180")
    if qname == "test.domain.com":
        if qtype in {"SOA","ANY"}:
            yield soa
        if qtype in {"NS","ANY"}:
            yield data.format(qtype="NS",rdata="ns1.test.domain.com")
        if qtype in {"A","ANY"}:
            yield data.format(qtype="A",rdata="1.2.3.4")
    elif qname == "ns1.test.domain.com": # glue
        if qtype in {"A","ANY"}:
            yield data.format(qtype="A",rdata="99.99.99.99")
    elif qname == "random.test.domain.com":
        if qtype in {"A","ANY"}:
            yield data.format(qtype="A", rdata=randomA())
        if qtype in {"AAAA","ANY"}:
            yield data.format(qtype="AAAA", rdata=randomAAAA())

async def rd(reader):
    data = await reader.readline()
    if data:
        line = data.decode()
        logger.debug(line.rstrip())
        return line
    else:
        return False

async def wr(writer, message):
    logger.debug(message.rstrip())
    writer.write(message.encode())
    await writer.drain()

async def handler(reader, writer):
    logger.info("Socket connected")
    line = await rd(reader)
    if line != "HELO\t3\n":
        await wr(writer, "LOG\tWrong protocol\nFAIL\n")
        writer.close()
        return
    await wr(writer,"OK\tmicrobackend.py\n")
    while True:
        line = await rd(reader)
        if not line:
            break # client closed
        a = line.split("\t")
        if a[0] != "Q":
            logger.error("Not supported")
            await wr(writer,"FAIL\n")
            continue
        # Process the results
        for v in process_query(a[1].lower(), a[3], a[5], a[7]):
            await wr(writer,v)
        await wr(writer,"END\n")
    writer.close()
    await writer.wait_closed()
    logger.info("Socket closed")

if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    os.umask(0)
    loop = asyncio.get_event_loop()
    def exit_signal():
        unix_server.close()
        loop.stop()
    for signame in {'SIGINT', 'SIGTERM'}:
        loop.add_signal_handler(getattr(signal, signame), exit_signal)    
    unix_server = asyncio.start_unix_server(handler, path=socket_path, loop=loop)
    loop.create_task(unix_server)
    loop.run_forever()
mind04 commented 2 years ago

This is the result of enabled consistent-backends with a disabled query cache. consistent-backends=no will restore the "old" behavior.

https://doc.powerdns.com/authoritative/settings.html#setting-consistent-backends

pawlakus commented 2 years ago

Thank you, indeed setting consistent-backends=no restores the previous lookup behaviour.

However, I feel that the new lookup algorithm is buggy when cache is either disabled or empty