PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.67k stars 907 forks source link

Issue with pdns-recursor YAML Configuration and dnsdist #14761

Open Khoshnevis opened 1 week ago

Khoshnevis commented 1 week ago

Short description

When using dnsdist (1.9.6) with pdns-recursor (5.1.1), everything works fine with the .conf configuration format. However, when switching to the YAML configuration format for pdns-recursor, dnsdist intermittently marks the recursor as "up" and "down", and the number of SERVFAIL responses increases. Reverting back to the .conf format resolves the issue.

Environment

Steps to reproduce

  1. Configure pdns-recursor using the .conf file.
  2. Verify that dnsdist and pdns-recursor are working normally (recursor stays "up", and no abnormal SERVFAIL responses).
  3. Switch pdns-recursor configuration to the YAML format.
  4. Observe dnsdist showing recursor as "up" and "down" intermittently, along with an increase in SERVFAIL responses. Note that our QPS per load balancer (dnsdist) is approximately 4,000 at minimum and 30,000 at maximum.

Expected behaviour

dnsdist should consistently mark the pdns-recursor as "up" when using the YAML configuration, similar to the behavior observed with the .conf configuration format. The number of SERVFAIL responses should remain low.

Actual behaviour

dnsdist intermittently marks the pdns-recursor as "up" and "down" when using the YAML configuration. There is a significant increase in SERVFAIL responses.

Other information

  1. DNSSEC differences: The .conf version uses dnssec=process, while the YAML version uses process-no-validate. This may contribute to the increased SERVFAIL responses (as i know).
  2. Attempted to use rec_control show-yaml recursor.conf. Although rec_control help listed show-yaml as an available command, executing it returned: Unknown command 'show-yaml', try 'help'.
  3. Temporary solution: Reverting to the .conf configuration format resolves the problem.

dnsdist.conf:

setACL({})
addACL("0.0.0.0/0")
addACL("::0/0")

setMaxUDPOutstanding(65535)
setRingBuffersSize(1000000, 10)
setMaxTCPQueuedConnections(0)
setTCPInternalPipeBufferSize(0)
setUDPSocketBufferSizes(0,0)
setUDPMultipleMessagesVectorSize(0)
releaseBuffers=true
setUDPTimeout(9)
setCacheCleaningPercentage(50)
setStaleCacheEntriesTTL(86400)
pc = newPacketCache(50000000, {maxTTL=86400, minTTL=0, temporaryFailureTTL=10, staleTTL=86400, dontAge=false})
getPool(""):setCache(pc)
setServerPolicy(leastOutstanding)

addAction(MaxQPSIPRule(200000, 32, 48), DropAction()) 
addAction(AllRule(), LogAction("/var/log/dnsdist/queries.log", false, true, true, true, true))
addResponseAction(AllRule(), LogResponseAction("/var/log/dnsdist/responses.log", true, true, true, true))
addAction(NotRule(RecordsCountRule(DNSSection.Question, 1, 1)), RCodeAction(DNSRCode.REFUSED))
addAction(OrRule({OpcodeRule(DNSOpcode.Notify), OpcodeRule(DNSOpcode.Update), QTypeRule(DNSQType.AXFR), QTypeRule(DNSQType.IXFR)}), RCodeAction(DNSRCode.REFUSED))
addAction(RegexRule("[,\\+\\?\\*\\;\\(\\)\\%\\:\\^\\~\\`\\#\\'\\@]"), RCodeAction(DNSRCode.REFUSED))

controlSocket('127.0.0.1:5199')
setKey("control-key")

addLocal("x.x.x.x:53" , {reusePort=true})
addLocal("[xxxx:xxxx:xxxx:xxxx:xxxx]:53" , {reusePort=true})

newServer({address="192.168.3.x", name="DNS Server 1"})
newServer({address="192.168.3.x", name="DNS Server 1"})
newServer({address="192.168.3.x", name="DNS Server 1"})
newServer({address="192.168.3.x", name="DNS Server 1"})
newServer({address="192.168.3.x", name="DNS Server 1"})
newServer({address="192.168.3.x", name="DNS Server 2"})
newServer({address="192.168.3.x", name="DNS Server 2"})
newServer({address="192.168.3.x", name="DNS Server 2"})
newServer({address="192.168.3.x", name="DNS Server 2"})
newServer({address="192.168.3.x", name="DNS Server 2"})

webserver("0.0.0.0:8080")
setWebserverConfig({password="fwutech", apiKey="superSecretAPIKey", acl="172.16.x.x/24"})

recursor.yml:

dnssec:
  aggressive_cache_min_nsec3_hit_ratio: 10000
  aggressive_nsec_cache_size: 300000
  validation: process-no-validate
incoming:
  allow_from:
    - 127.0.0.0/8
    - 192.168.0.0/16
    - 172.16.0.0/12
    - ::1/128
    - fc00::/7
    - fe80::/10
  distribution_load_factor: 1.25
  distribution_pipe_buffer_size: 0
  distributor_threads: 20
  listen:
    - 127.0.0.1
    - 192.168.x.x
    - ::1
  max_concurrent_requests_per_tcp_connection: 20
  max_tcp_clients: 1024
  max_udp_queries_per_round: 65000
  pdns_distributes_queries: True
  port: 53
  tcp_fast_open: 1
logging:
  loglevel: 1
  quiet: False
  trace: False
outgoing:
  dont_query:
    - 127.0.0.0/8
    - 10.0.0.0/8
    - 100.64.0.0/10
    - 169.254.0.0/16
    - 192.168.0.0/16
    - 172.16.0.0/12
    - ::1/128
    - fc00::/7
    - fe80::/10
  network_timeout: 1600
  source_address:
    - x.x.x.x
  tcp_fast_open_connect: True
  udp_source_port_max: 65530
  udp_source_port_min: 1024
packetcache:
  disable: False
  max_entries: 40000000
  negative_ttl: 120
  servfail_ttl: 0
  shards: 4096
  ttl: 86400
recordcache:
  max_cache_bogus_ttl: 14400
  max_entries: 10000000
  max_negative_ttl: 7200
  max_ttl: 86400
  shards: 4096
recursor:
  config_dir: /etc/powerdns
  cpu_map: 0=0 1=1 2=2 3=3 4=4 5=5 6=6 7=7 8=8 9=9 10=10 11=11 12=12 13=13 14=14 15=15 16=16 17=17 18=18 19=19 20=20 21=21 22=22 23=23 24=24 25=25 26=26 27=27 28=28 29=29 30=30 31=31 32=32 33=33 34=34 35=35 36=36 37=37 38=38 39=39
  forward_zones:
    - zone: [excluded TLD]
      recurse: False
      forwarders:
        - x.x.x.x
        - x.x.x.x
        - x.x.x.x
    - zone: [excluded TLD]
      recurse: False
      forwarders:
        - x.x.x.x
        - x.x.x.x
        - x.x.x.x

  forward_zones_recurse:
    - zone: .
      recurse: False
      forwarders:
        - 1.0.0.1
        - 1.1.1.1
        - 8.8.4.4
        - 8.8.8.8

  hint_file: no
  max_mthreads: 4096
  max_total_msec: 7600
  qname_minimization: True
  setgid: pdns
  setuid: pdns
  stack_cache_size: 512
  stack_size: 1048576
  threads: 40
  version_string: Miu-Miu!
webservice:
  address: 0.0.0.0
  allow_from:
    - 172.16.x.x/24
  api_key: [CENSORED]
  password: [CENSORED]
  port: 8082
  webserver: True

recursor.conf:

setgid=pdns
setuid=pdns
config-dir=/etc/powerdns
hint-file=/etc/powerdns/root.hints
allow-from=127.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12, ::1/128, fc00::/7, fe80::/10
dont-query=127.0.0.0/8, 10.0.0.0/8, 100.64.0.0/10, 169.254.0.0/16, 192.168.0.0/16, 172.16.0.0/12, ::1/128, fc00::/7, fe80::/10, 0.0.0.0/8, 192.0.0.0/24, 192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24, 240.0.0.0/4, ::/96, ::ffff:0:0/96, 100::/64, 2001:db8::/32
local-address=127.0.0.1, 192.168.x.x, ::1
query-local-address=x.x.x.x
udp-source-port-min=1024
udp-source-port-max=65530
qname-minimization=yes
local-port=53
loglevel=1
quiet=yes
trace=no
dnssec=process
aggressive-nsec-cache-size=300000
aggressive-cache-min-nsec3-hit-ratio=10000
distribution-load-factor=1.25
distribution-pipe-buffer-size=0
#distributor-threads=40
pdns-distributes-queries=yes
threads=40
distributor-threads=20
max-mthreads=4096
stack-size=1048576
stack-cache-size=512
max-total-msec=7600
max-udp-queries-per-round=65000
cpu-map=0=0 1=1 2=2 3=3 4=4 5=5 6=6 7=7 8=8 9=9 10=10 11=11 12=12 13=13 14=14 15=15 16=16 17=17 18=18 19=19 20=20 21=21 22=22 23=23 24=24 25=25 26=26 27=27 28=28 29=29 30=30 31=31 32=32 33=33 34=34 35=35 36=36 37=37 38=38 39=39
disable-packetcache=no
max-packetcache-entries=40000000
max-cache-entries=10000000
record-cache-shards=4096
packetcache-shards=4096
max-cache-ttl=86400
max-concurrent-requests-per-tcp-connection=20
max-tcp-clients=1024
packetcache-ttl=86400
packetcache-negative-ttl=120
packetcache-servfail-ttl=0
max-negative-ttl=7200
max-cache-bogus-ttl=14400
network-timeout=1600
tcp-fast-open=1
tcp-fast-open-connect=yes
version-string=Miu-Miu!
webserver=True
webserver-address=0.0.0.0
webserver-port=8082
webserver-allow-from=172.16.x.x/24
webserver-password=[CENSORED]
api-key=[CENSORED]

forward-zones=[excluded TLD]=x.x.x.x;x.x.x.x;
forward-zones=[excluded TLD]=x.x.x.x;x.x.x.x;
forward-zones-recurse=.=1.0.0.1;1.1.1.1;8.8.4.4;8.8.8.8;
omoerbeek commented 1 week ago

You pastes are hard to read please add `` instead of around the blocks.

rec_control show-yaml not working is strange. Can you show the exact output of the command?

One thing I spotted is that you have logging.quiet: False. That will generate a lot of logging.

Below is the converted recursor.conf I did locally (note you need to fix a few entries becuase the original was redacted).

# Start of converted recursor.yml based on /tmp/x.conf
dnssec:
  aggressive_cache_min_nsec3_hit_ratio: 10000
  aggressive_nsec_cache_size: 300000
  validation: process
incoming:
  allow_from:
  - 127.0.0.0/8
  - 192.168.0.0/16
  - 172.16.0.0/12
  - ::1/128
  - fc00::/7
  - fe80::/10
  distribution_load_factor: 1.25
  distribution_pipe_buffer_size: 0
  distributor_threads: 20
  listen:
  - 127.0.0.1
  - 192.168.x.x
  - ::1
  max_concurrent_requests_per_tcp_connection: 20
  max_tcp_clients: 1024
  max_udp_queries_per_round: 65000
  pdns_distributes_queries: true
  port: 53
  tcp_fast_open: 1
logging:
  loglevel: 1
  quiet: true
  trace: no
outgoing:
  dont_query:
  - 127.0.0.0/8
  - 10.0.0.0/8
  - 100.64.0.0/10
  - 169.254.0.0/16
  - 192.168.0.0/16
  - 172.16.0.0/12
  - ::1/128
  - fc00::/7
  - fe80::/10
  - 0.0.0.0/8
  - 192.0.0.0/24
  - 192.0.2.0/24
  - 198.51.100.0/24
  - 203.0.113.0/24
  - 240.0.0.0/4
  - ::/96
  - ::ffff:0:0/96
  - 100::/64
  - 2001:db8::/32
  network_timeout: 1600
  source_address:
  - x.x.x.x
  tcp_fast_open_connect: true
  udp_source_port_max: 65530
  udp_source_port_min: 1024
packetcache:
  disable: false
  max_entries: 40000000
  negative_ttl: 120
  servfail_ttl: 0
  shards: 4096
  ttl: 86400
recordcache:
  max_cache_bogus_ttl: 14400
  max_entries: 10000000
  max_negative_ttl: 7200
  max_ttl: 86400
  shards: 4096
recursor:
  config_dir: /etc/powerdns
  cpu_map: 0=0 1=1 2=2 3=3 4=4 5=5 6=6 7=7 8=8 9=9 10=10 11=11 12=12 13=13 14=14 15=15 16=16 17=17 18=18 19=19 20=20 21=21 22=22 23=23 24=24 25=25 26=26 27=27 28=28 29=29 30=30 31=31 32=32 33=33 34=34 35=35 36=36 37=37 38=38 39=39
  forward_zones:
  - zone: '[excluded'
    recurse: false
    forwarders: []
  - zone: TLD]
    recurse: false
    forwarders:
    - x.x.x.x
    - x.x.x.x
  forward_zones_recurse:
  - zone: .
    recurse: true
    forwarders:
    - 1.0.0.1
    - 1.1.1.1
    - 8.8.4.4
    - 8.8.8.8
  hint_file: /etc/powerdns/root.hints
  max_mthreads: 4096
  max_total_msec: 7600
  qname_minimization: true
  setgid: pdns
  setuid: pdns
  stack_cache_size: 512
  stack_size: 1048576
  threads: 40
  version_string: Miu-Miu!
webservice:
  address: 0.0.0.0
  allow_from:
  - 172.16.x.x/24
  api_key: '[CENSORED]'
  password: '[CENSORED]'
  port: 8082
  webserver: true
# Validation result: incoming.listen: value `192.168.x.x' is not an IP or IP:port combination
# End of converted /tmp/x.conf
#
Khoshnevis commented 1 week ago
  1. I have fixed the markdown formatting problem.
  2. As for logging.quiet: False, this is temporarily set to False to assist in debugging.
  3. Here is the exact output when I attempted to use rec_control show-yaml:
    
    root@[CENSORED]:/etc/powerdns# rec_control help
    add-dont-throttle-names [N...]   add names that are not allowed to be throttled
    add-dont-throttle-netmasks [N...] add netmasks that are not allowed to be throttled
    add-nta DOMAIN [REASON]          add a Negative Trust Anchor for DOMAIN with the comment REASON
    add-ta DOMAIN DSRECORD           add a Trust Anchor for DOMAIN with data DSRECORD
    current-queries                  show currently active queries
    clear-dont-throttle-names [N...] remove names that are not allowed to be throttled. If N is '*', remove all
    clear-dont-throttle-netmasks [N...] remove netmasks that are not allowed to be throttled. If N is '*', remove all
    clear-nta [DOMAIN]...            Clear the Negative Trust Anchor for DOMAINs, if no DOMAIN is specified, remove all
    clear-ta [DOMAIN]...             Clear the Trust Anchor for DOMAINs
    dump-cache <filename>            dump cache contents to the named file
    dump-dot-probe-map <filename>    dump the contents of the DoT probe map to the named file
    dump-edns [status] <filename>    dump EDNS status to the named file
    dump-failedservers <filename>    dump the failed servers to the named file
    dump-non-resolving <filename>    dump non-resolving nameservers addresses to the named file
    dump-nsspeeds <filename>         dump nsspeeds statistics to the named file
    dump-saved-parent-ns-sets <filename>
                                 dump saved parent ns sets that were successfully used as fallback
    dump-rpz <zone name> <filename>  dump the content of a RPZ zone to the named file
    dump-throttlemap <filename>      dump the contents of the throttle map to the named file
    get [key1] [key2] ..             get specific statistics
    get-all                          get all statistics
    get-dont-throttle-names          get the list of names that are not allowed to be throttled
    get-dont-throttle-netmasks       get the list of netmasks that are not allowed to be throttled
    get-ntas                         get all configured Negative Trust Anchors
    get-tas                          get all configured Trust Anchors
    get-parameter [key1] [key2] ..   get configuration parameters
    get-proxymapping-stats           get proxy mapping statistics
    get-qtypelist                    get QType statistics
                                 notice: queries from cache aren't being counted yet
    get-remotelogger-stats           get remote logger statistics
    hash-password [work-factor]      ask for a password then return the hashed version
    help                             get this list
    list-dnssec-algos                list supported DNSSEC algorithms
    ping                             check that all threads are alive
    quit                             stop the recursor daemon
    quit-nicely                      stop the recursor daemon nicely
    reload-acls                      reload ACLS
    reload-lua-script [filename]     (re)load Lua script
    reload-lua-config [filename]     (re)load Lua configuration file
    reload-zones                     reload all auth and forward zones
    set-ecs-minimum-ttl value        set ecs-minimum-ttl-override
    set-max-aggr-nsec-cache-size value set new maximum aggressive NSEC cache size
    set-max-cache-entries value      set new maximum record cache size
    set-max-packetcache-entries val  set new maximum packet cache size
    set-minimum-ttl value            set minimum-ttl-override
    set-carbon-server                set a carbon server for telemetry
    set-dnssec-log-bogus SETTING     enable (SETTING=yes) or disable (SETTING=no) logging of DNSSEC validation failures
    set-event-trace-enabled SETTING  set logging of event trace messages, 0 = disabled, 1 = protobuf, 2 = log file, 3 = both
    show-yaml [file]                 show yaml config derived from old-style config
    trace-regex [regex file]         emit resolution trace for matching queries (no arguments clears tracing)
    top-largeanswer-remotes          show top remotes receiving large answers
    top-queries                      show top queries
    top-pub-queries                  show top queries grouped by public suffix list
    top-remotes                      show top remotes
    top-timeouts                     show top downstream timeouts
    top-servfail-queries             show top queries receiving servfail answers
    top-bogus-queries                show top queries validating as bogus
    top-pub-servfail-queries         show top queries receiving servfail answers grouped by public suffix list
    top-pub-bogus-queries            show top queries validating as bogus grouped by public suffix list
    top-servfail-remotes             show top remotes receiving servfail answers
    top-bogus-remotes                show top remotes receiving bogus answers
    unload-lua-script                unload Lua script
    version                          return Recursor version number
    wipe-cache domain0 [domain1] ..  wipe domain data from cache
    wipe-cache-typed type domain0 [domain1] ..  wipe domain data with qtype from cache

root@[CENSORED]:/etc/powerdns# rec_control show-yaml Unknown command 'show-yaml', try 'help'

omoerbeek commented 1 week ago

What does rec_control --version show? I think you did not update rec_control.

Khoshnevis commented 1 week ago
root@[CENSORED]:/etc/powerdns# rec_control --version
rec_control version 5.1.1
root@[CENSORED]:/etc/powerdns# 
omoerbeek commented 1 week ago

This is very puzzling, the only thing that I can think of is that your sources are not up to date. Can you try running a published version from repo.powerdns.com?

Khoshnevis commented 1 week ago

Thank you for your suggestion. I'll consider trying the published version from repo.powerdns.com to confirm.

However, even if we set aside the show-yaml issue, I’m still seeing dnsdist marking the recursor as "up" and "down" when using the recursor.yml config, but everything works perfectly when I switch back to the recursor.conf format.

Could the intermittent "up and down" behavior in dnsdist with the YAML config also be related to issues from the compiled version? or do you think the issue lies elsewhere in the configuration or handling of the YAML format?

I’d appreciate any insights on this, as the behavior difference between the two formats is also very puzzling.

omoerbeek commented 1 week ago

Your own compiled version of rec_control shows unexplained behaviour, so lets try to reproduce with an officially published version first and then diagnose further (if needed).

Khoshnevis commented 1 week ago

Thank you for your time and assistance. I’ll go ahead and try with the officially published version first, and I’ll reach out again if further diagnosis is needed.

Thanks again!