essandess / adblock2privoxy

Convert adblock config files to privoxy format
https://hackage.haskell.org/package/adblock2privoxy
GNU General Public License v3.0
93 stars 16 forks source link

Double rules on filters are too broad #10

Closed wmyrda closed 6 years ago

wmyrda commented 6 years ago

Converter creates double rules on some filter sets which lead to unintended blocks.

For example czech ruleset for instance from record /emonitor. creates

# /emonitor. (filters.txt: 35)
/(?:(*PRUNE).*?/)?emonitor\.
emonitor.*.

and polish ruleset

# /inst. (adblock_social_list.txt: 84)
/(?:(*PRUNE).*?/)?inst\.
inst.*.
# /ins. (adblock_social_list.txt: 83)
/(?:(*PRUNE).*?/)?ins\.
ins.*.
# /in. (adblock_social_list.txt: 82)
/(?:(*PRUNE).*?/)?in\.
in.*.

While the first rule for each is OK, the later is interpreted by privoxy as the rule for whole word instead of subdomain of *. In last case it leads to blocking of interia.pl which should never happen.

wmyrda commented 6 years ago

As workaround I used sed -i -e '/\.\*\./s/^/#/' ab2p.action which hashed out all the paterns containing .*. which are as bug reporter in original code has put it to greedy

Hashing them out seems best way to go for now as most of those have a proper line above.

EDIT: Line above hashes out even some properly generated filters. Below hashes those lines, but with exception of the rules containing word PRUNE sed -i -e '/\.\*\./{/PRUNE/b;s/^/#/}' ab2p.action

essandess commented 6 years ago

Thanks for this. I (believe that I) added these regex’s for privoxy rules so I need to give it some thought.

Would you please break it down very simple which (example) rule causes the issue, and where the resulting issue is visible. I’m not entirely tracking the basic issue as you’ve described it.

wmyrda commented 6 years ago

using polish adblock list creates in action file rule in.*. which blocks interia.pl which I believe is unintended and should be blocked only in case the website would be in.interia.pl or in.interia.com.

wmyrda commented 6 years ago

Here I got solution in python which better deals with extra rules needlessly set leaving out only the first as the one best matching original record.

essandess commented 6 years ago

I see this too. I’m going to try to pick a representative example from easylist.txt to track this down. Here’s one in ab2p.action:

# @@||www.google.*/aclk?*&adurl=$subdocument,~third-party,domain=google.ca|googl
e.co.in|google.co.nz|google.co.uk|google.com|google.com.au|google.com.eg|google.
de|google.es (easylist.txt: 69024)
.www.google.*./(?:(*PRUNE).*?/)?aclk\?(*PRUNE).*?&adurl=

I added the regex PRUNE rules for wildcards to PatternConverter.hs in https://github.com/essandess/adblock2privoxy/commit/a68371e2a9d87e24337605fed3b325d09d517703#diff-65132c467ce6dbf2a88b898ab223a563, so the first thing I must do is make sure I didn’t mess anything up when I did that.

Added: This transformation of the original rule does not appear to be correct. Transforming by hand, the privoxy regex in ab2p.action should be (per privoxy.org/user-manual/actions-file.html):

www.google.*/aclk\?(*PRUNE).*?&adurl=

Note:

The pattern matching syntax is different for the host and path parts of the URL. The host part uses a simple globbing type matching technique, while the path part uses more flexible "Regular Expressions" (POSIX 1003.2).

Do you agree that that .*. pattern should not be there in this example?

Added: This transformation of the original rule does not appear to be correct. Transforming by hand, the privoxy regex in ab2p.action should be:

.www.google.(?:(*PRUNE).*?/)?aclk\?(*PRUNE).*?&adurl=

(And I’m not sure why the initial . is there—I’ll have to go review privoxy’s syntax.)

essandess commented 6 years ago

@wmyrda Does the diff you posted in https://github.com/essandess/adblock2privoxy/issues/19#issuecomment-407112836 fix these issues?

I haven’t wrapped my head around this code block yet, but a few issues come to mind: 1. It looks like the String ([Char]) initial fragment '.' : '*' is intended to be a wildcard, in which case this should be replaced by the regex (*PRUNE).*?. (Privoxy host patterns use globbing.)

  1. Why did you find it necessary to add the changeMiddle in the pattern match?
  2. Which part of the code creates the double rules?
essandess commented 6 years ago

Here’s another example of incorrect transformation:

# /antiadblockmsg. (antiadblockfilters.txt: 85)
/(?:(*PRUNE).*?/)?antiadblockmsg\.
antiadblockmsg.*.

I don’t see how the function makePattern could mangle this if it’s being sent correctly parsed url parts. That would narrow down the issue to the function parseUrl. Also, there’s only a single instance of the string antiadblockmsg so why are there two privoxy rules for /antiadblockmsg.. Do you agree @wmyrda or have I missed something?

essandess commented 6 years ago

The issue is definitely in parseUrl. I’ve been playing around with Debug.Trace statements with a single-rule test.txt file and see that this function currently returns the observed list of (incorrect) [String]s that appear in ab2p.action when the rule is a query with an initial /.

Focus attention on fixing urlParse.

These rules illustrate this function working correctly and incorrectly:

/foo.
/bar1.html
bar2.html
?q=bar3.html
*/ads/*
/banner/*/img^
||ads.example.com^

Incorrect:

/foo. ➡️

/(?:(*PRUNE).*?/)?foo\.
foo.*.

/bar1.html ➡️

/(?:(*PRUNE).*?/)?bar1\.html
bar1.html*.

bar2.html ➡️

/(*PRUNE).*?bar2\.html
.*bar2.html*.

Correct:

?q=bar3.html ➡️ /(*PRUNE).*?\?q=bar3\.html

*/ads/* ➡️ /(?:(*PRUNE).*?/)?ads/(*PRUNE).*?

/banner/*/img^ ➡️ /(?:(*PRUNE).*?/)?banner/(*PRUNE).*?/img[^\w%.-]

||ads.example.com^ ➡️ .ads.example.com

Fixing this will require deconstructing the functional syntax used throughout urlParse.

essandess commented 6 years ago

I fixed these issues and others in the latest commit, https://github.com/essandess/adblock2privoxy/commit/e6bc337a5800a835ff621e6731316a33f82fa03c#diff-65132c467ce6dbf2a88b898ab223a563. I tested many different patterns and believe that rule parsing is working correctly now.

The function urlParse is amazing amalgamation of Monads and Text.Parsec functions.

A few comments if this must ever be revisited: