Useless error on RPZ-zone loading + a few suggestions

spirillen commented 4 years ago

Program: Recursor
Issue type: Bug report, Feature request

Short description

I'm running a larger RPZ zone 650.000+ records with only RPZ records where I suttently got this error in the log files

To get the full picture on this issue you should also read #9035

Loading RPZ zone 'rpz.mypdns.cloud' from [2a01:4f8:1c1c:abe4::53]:53
ns1 pdns_recursor[23137]: Loaded & indexed 0 policy records so far for RPZ zone 'rpz.mypdns.cloud'
Apr 14 23:20:18 ns1 pdns_recursor[23137]: Unable to load RPZ zone 'rpz.mypdns.cloud' from '2a01:4f8:1c1c:abe4::53': 'Adding a QName-based filter policy of kind . but a policy of kind . already exists for the following QName: ccp.ac'. (Will try again in 60 seconds...)

It turns out this error was rather useless as the real issue was a rpz-ip records that was missing it's CIDR notation.

was:
222.186.190.92.rpz-client-ip.rpz.mypdns.cloud   CNAME   rpz-drop.

should be:
32.222.186.190.92.rpz-client-ip.rpz.mypdns.cloud    CNAME   rpz-drop.

As the log was indicating it should have been the first record of the zone file, it took quit some time to locate the actually error record.

I did also check the zone records with pdnsutil rectify- and check-zone without any errors, it said success.....

Reproduce

pdnsutil add-record zone "32.222.186.190.92.rpz-client-ip.rpz" CNAME "rpz-drop."

Now take a look in the DB and you'll see that the ending dot for the rpz-drop is missing. Again the rpz syntax might tricker a different check of the records added.

Environment

Operating system: ubuntu 18.04
Software version: master
Software source: Repo

Steps to reproduce

See above record example

Expected behavior

I would expect the error log to print the exact record that contains an error, and as this is a RPZ-zone (NXDOMAIN (primarily) i would expect/hope the other 99.9% was loaded, but the log showing the troubled record to be fixed (a.s.a.p.).

On IRC there was a comment about failing to load the zone:

Some people might use RPZ to host actual records they need for services. I mean, I don't, but I'm considering it.

In my (current) opinion this is wrong, but of curse durable. But to hosts actual "master" records I would expect an actual auth zone. Let this be my initial thoughts on this..... and open to comments.

Habbie commented 4 years ago

The pdnsutil should recognize this type of errors an perhaps recognized by any of the rpz-(clint)-ip or drop syntax.

The RPZ code lives in the Recursor, pdnsutil is part of the auth, so this suggestion does not make sense in this form. A feature request for a standalone RPZ syntax checker would need to be a separate ticket.

Now take a look in the DB and you'll see that the ending dot for the rpz-drop is missing.

That is normal.

rgacogne commented 4 years ago

I would expect the error log to print the exact record that contains an error, and as this is a RPZ-zone (NXDOMAIN (primarily) i would expect/hope the other 99.9% was loaded, but the log showing the troubled record to be fixed (a.s.a.p.).

I agree we could do better at reporting the exact record causing the issue. I however don't think we should load ignore that error, there is clearly something wrong in that RPZ zone and I believe we should either load a complete zone or not load it at all. Loading a zone with invalid content would lead to unexpected issues most of the time.

Some people might use RPZ to host actual records they need for services. I mean, I don't, but I'm considering it.

I'm not sure I understand your point, custom records with non-RPZ targets are already supported? However they can't be a CNAME to rpz-*, the RPZ specs reserves these names in section 3.6. "The "Local Data" Action (arbitrary RR types)".

Habbie commented 4 years ago

I understand your point,

The quote is an argument from a different user on IRC, who also said 'ignoring errors is bad'.

spirillen commented 4 years ago

I would expect the error log to print the exact record that contains an error, and as this is a RPZ-zone (NXDOMAIN (primarily) i would expect/hope the other 99.9% was loaded, but the log showing the troubled record to be fixed (a.s.a.p.).

I agree we could do better at reporting the exact record causing the issue. I however don't think we should load ignore that error, there is clearly something wrong in that RPZ zone and I believe we should either load a complete zone or not load it at all. Loading a zone with invalid content would lead to unexpected issues most of the time.

Then what about a switch? Which could be added to the rpz configuration and "automatically" add a TXT reply to any working record in that zone, there is at least one troubled record in that zone. ex. if it is a "whitelisting" zone where all records intentionally holds the rpz-passthru. arguments, I would prefer that the 99% would still be loaded as rpz-passthru. that keeps risking to block 1000's domains from passing through to client requests.

dig github.com @whitelisted.records.rpz
github.com.             60      IN      A       NXDOMAIN
github.com.             60      IN      TXT   "There is at least one troubled record in this zone"

Then when you are trying to troubleshot why you can't reach github.com and your first tool is dig/drill or similar you are notified straight up..... look in your servers log files NOW.

update note: just a typo

PowerDNS / pdns