Recursor: Some valid cache entries are not being used causing unnecessary requests to external resolvers.

carlos-n commented 3 years ago

Program: Recursor
Issue type: Bug report

Short description

Some valid cache entries are not being used causing unnecessary requests to external resolvers.

Environment

Operating system: Centos 7.6
Software version: 4.4.2 and higher versions
Software source: pdns-recursor-4.4.2-1pdns.el7.x86_64

Steps to reproduce

The following explanation describes not our exact use case for confientiality reasons, but describes an example that ilustrates perfectly the issue.

We have forward-zones-recurse=.=8.8.8.8 configured in our recursor. We use this configuration because we have limited access to internet from our recursor and thus we wouldn't be able to do the recursion ourselves.
We have an instance of PDNS Auth server with BIND backend where we have a zone called "lab.test.net" in which we have defined a CNAME record like this

recursion-test 10800 CNAME api-global.netflix.com.
In forward-zones file we include a rule that instructs the recursor to go to our PDNS Auth instance in order to resolve any name belonging to "lab.test.net".

We try to resolve "recursion-test.lab.test.net" against our recursor instance. We obtain the following result (i have masked the actual IPs in the answer with XX.XX.XX.XX)

;; ANSWER SECTION:
recursion-test.lab.test.net. 8875 IN    CNAME   api-global.netflix.com.
api-global.netflix.com. 140     IN      CNAME   api-global.dradis.netflix.com.
api-global.dradis.netflix.com. 18 IN    CNAME   api-global.eu-west-1.origin.prodaa.netflix.com.
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX
api-global.eu-west-1.origin.prodaa.netflix.com. 18 IN A XX.XX.XX.XX

The first resolution from "recursion-test.lab.test.net" to "api-global.netflix.com" is locally provided by our PDNS Auth instance. The rest of the answer in provided by "8.8.8.8".

Expected behaviour

In further requests to resolve "recursion-test.lab.test.net" we expect our recursor to use the cached responses without going to "8.8.8.8" as far as the lower TTL lasts. This scenario is working this way until recursor version 4.3.7.

Actual behaviour

In further requests to resolve "recursion-test.lab.test.net" the recursor is not sending any request to our PDNS Auth instance as it has a valid cache entry with a positive TTL for "recursion-test.lab.test.net", but is always sending a request to "8.8.8.8" for "api-global.netflix.com" despite of the fact of having valid cache entries with positive TTLs for this name.

Somehow our recursor is not considering usable the cache entries of the names it has resolved against "8.8.8.8" in this particular scenario of nested resolutions. A curious fact about these unnecessary requests to "8.8.8.8" is that they are made with recursion desired flag set to "0" despite of the fact of having "forward-zones-recurse" activated.

If we try to resolve "api-global.netflix.com" directly against our recursor, the behaviour is as expected and recursor is able to use the chached entries in further requests until TTL expiration.

Other information

I have compared the cache entries in the correct case (version 4.3.7) and the wrong one (versions 4.4.2 and 4.5.2) and are identical. I'm afraid a change of behaviour regarding this kind of scenario has been introduced in branch 4.4.X and inherited in branch 4.5.x.

rgacogne commented 3 years ago

Would you mind providing your full configuration? Otherwise we need to guess which settings you are running with and it wastes everyone's time if we guess wrong. A full trace (logs when running the recursor with --trace) of the initial and further requests would be very helpful as well, if possible, or a targeted trace using rec_control trace-regex 'recursion-test\.lab\.test\.net\.$ before sending the queries otherwise.

carlos-n commented 3 years ago

I attach recursor configuration and forward-zones file for server with version 4.3.7 and server with 4.4.2. I also attach traces for both servers. Thanks so much.

recursor-config_v4-4-2.txt traces_v4-3-7.txt traces_v4-4-2.txt forward-zones-v4-3-7.txt forward-zones-v4-4-2.txt recursor-config_v4-3-7.txt

rgacogne commented 3 years ago

Thanks! I think what is happening is caused by a change introduced in https://github.com/PowerDNS/pdns/pull/9351. We used to disable qname minimization for all forwards and after this PR it seems that we only disable for recursive forwards, and in your case the first name is for a non-recursive forward so since 4.4.0 QM stays enabled. We should still look at the cache even with QM enabled, so I believe this is a bug. I'm guessing you do not care about qname minimization since you forward everything to 8.8.8.8 anyway, so perhaps you could disable QM as a work-around by setting: qname-minimization=no. If you do that, please report back so we can narrow the issue down :)

carlos-n commented 3 years ago

I've done some tests with qname-minimization=no and it seems to work as it should. I will test some more and let you know if everything is OK. Thanks again !!!!

spirillen commented 2 years ago

Hi @carlos-n How did you tests come out?

I'm curious as I use a similar setup, so without actually testing this I suppose it concerns me as well.

carlos-n commented 2 years ago

Hi everyone Sorry for the silence. I've been disconnected from this subject for a while. The behaviour of the recursor improved after disabling QM and we are using this configuration in production, but the behaviour of 4.4.x is still different from branch 4.3.x. It is not as noticeable as before disabling QM, but i still find some weird scenarios. Previous to branch 4.4.x, the rules in "forward-zones" had the top priority when routing requests despite of anything that could have been cached by the recursor. But in versions 4.4.x, i'm finding cases in which a cached NS record prevails over "forward-zones" rules, and this is a game change of unpredictable and usually bad consequences for all the ones (like me) who relied on "forward-zones" as the master routing decision table for our architecture. Should "forward-zones" still prevail over anything else in branches 4.4.x or later ???? Or is it an expected behaviour that sometimes they don't ???? Thanks in advance !!!!

omoerbeek commented 2 years ago

Hi,

Let me try to explain.

Yes, the behaviour with respect to forward-zones (but not forward-zones-recurse) has changed. Since 4.4.x, NS records learned from hosts forwarded to are used to resolve names in subdomains. See https://docs.powerdns.com/recursor/settings.html#forward-zones for an explanation why (this explanation was added quite recently). I'd say the old behaviour was buggy, or at least not very useful.

This does have consequences: you can only forward to servers that are authoritative for the domain and NS records coming from these servers should point to proper authoritative servers for subdomains of the forwarded domain. If that is not the case, things might break.

forward-zones-recurse is different: in this case the target only needs to be able to resolve (all) names in the forwarded domain and no NS complications occur.

There still might be bugs of course. So if you still think you hit a bug after this explanation, please show us traces so we can investigate.

carlos-n commented 2 years ago

Thanks for the info. I'm going to confirm if the cases i've been detecting match this policy. I'm afraid they do.

webfutureiorepo commented 2 years ago

H3llo,

If i May ask a few Questions.

I see NSD has refernshed their logos, new more clean Code, ovn EPP regisry opensource, router ssoftware, BGP ovn Version and what i miss the most they have as Knot do a own Advanced but still simple web gul to Control the whole DNS servers. Thats the only thing i would love if you could make available. I know you have one excellent Version for PowerDNS but ist incorporated into the XO Software. Is it possible for you to make a simple version of that one or the whole one exact same versioon without the office/deskktop and all that Cloud.

2.We used Posgresql so much but at ONCE we setup up a replica of Posgrewsql on another server it dont Crash but it destroyers the powerdns everytime. What DB is the BEST and MOST stable to use with replica or master - slave DB? I read a whole lot of positive about those who use Sqlite. And some even made the new from Redis 6 work Even better, now it s redis 7. It has Even a ovn Redis Stack noow wih incredible GUI. It can do multiple things at the same time, being 3* faster for powerdns Than Sqlite, mysql and MariaDB, and do cache for web. I sak een some god CoackroachDB is inctredible and works with pens. Just wondering since it is a huge difference on what they support and most of all trafic.

Powerdns, Dnsdist and Pdns Recursive, should or is it well normal to have all those 3 working in symfoni together without and sort of fuzz ? No need for databases. BIND, which screws up most it touches. But in theory those 3 was made for eachother right ? The Pdns Auth as master, Recursors a slave and dnsdist as well cache, buffer, stats and so on (And a web gul if you wouldnt mind like in the 1 question or hint if f.example nedejs, Django, and o db. Ive read all your documentaton online, and older ode but no luck n how to set that up. Any ideas ? And idea, howto guides ?

I have since we have been a DigiCert Partner for over 20 years free DigiCert Business Plus Wildcard, with PKI, and moniorignng and skanning for virus, intrusions and so on, But they dent have a PowerDns plugin to Control it. So if anyone know if Netstat, Dynatrace,

mai 2022 kl. 20:24 skrev Otto Moerbeek @.***>:

Hi,

Let me try to explain.

Yes, the behaviour with respect to forward-zones (but not forward-zones-recurse) has changed. Since 4.4.x, NS records learned from hosts forwarded to are used to resolve names in subzones. See https://docs.powerdns.com/recursor/settings.html#forward-zones https://docs.powerdns.com/recursor/settings.html#forward-zones for an explanation why (this explanation was added quite recently). I'd say the old behaviour was buggy, or at least not very useful.

This does have consequences: you can only forward to servers that are authoritative for the domain and NS records coming from these servers should point to proper authoritative servers for subdomains of the forwarded domain. If that is not the case, things might break.

forward-zones-recurse is different: in this case the target only needs to be able to resolve (all) names in the forwarded domain and no NS complications occur.

There still might be bugs of course. So if you still think you hit a bug after this explanation, please show us traces so we can investigate.

— Reply to this email directly, view it on GitHub https://github.com/PowerDNS/pdns/issues/10533#issuecomment-1121431902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVFAEYVFG5A36GQDVF3QPT3VJFJ7JANCNFSM47NZ6QZQ. You are receiving this because you are subscribed to this thread.

phonedph1 commented 2 years ago

forward-zones-recurse is different: in this case the target only needs to be able to resolve (all) names in the forwarded domain and no NS complications occur.

The thing to note here is you can (probably always?) still use this to get the same behaviour as before even when talking to auths. Unless you have something that actually cares if it got rd=1 or not.

PowerDNS / pdns