PyFunceble / adblock-decoder

A set of tool for the decoding and conversion of AdBlock and Filter Lists.
https://pyfunceble.github.io
Other
13 stars 0 forks source link

Invalid output #3

Open smed79 opened 1 year ago

smed79 commented 1 year ago

Testing the below list

||com/*?adver=123.456
||aaa.*/ads1/
||aaa.*/ads2/
||bbb.com^*ads3
||ccc.com^*ads4
||ddd.com*.ads.com
||eee.com/*$image,domain=fff.com
||ggg.hhh-*
||ggg.hhh-*
||ggg.hhh*iii-jjj
|http://kkk.com/ads/*
|https://kkk.lll/*
|https://ads.mmm.nnn^
||ooo.com/*.ppp
||qqq.com/img1*.ads8
||qqq.com/img2/*.ads9
||qqq.com/img3/.*./ads0
||qqq.com/img4/*.*/
||qqq.com/img5/*..*/
||rrr.com/*.php?123
||sss.com^*/img/
||ttt.com/*/ban.js
||uuu.com/*$script

Output

?adver=123.456
aaa.
aaa.
bbb.com
ccc.com
.ads.com
ddd.com
eee.com
ggg.hhh-
ggg.hhh-
ggg.hhh
.ppp
ooo.com
.ads8
qqq.com
.ads9
qqq.com
.
qqq.com
.
qqq.com
..
qqq.com
.php
rrr.com
sss.com
ttt.com
uuu.com

The expected output should be

kkk.lll
ads.mmm.nnn
funilrys commented 1 year ago

@smed79 please review the testcases before I deploy/release my change: https://github.com/funilrys/PyFunceble/commit/d32914b7dd1381f47e7e96d76458e986c3550cf0#diff-6fbb548d14d904b48cdaa09ea8c1ca04249d69cef0763217ac957605c50548a6R278-R326

Let me know If I missed a test case.

Stay safe and healthy! Thank you for your patience.

smed79 commented 1 year ago

1st,

There is no such cas in adblock (plus) syntax.

blocked requests (files or domains) cannot be separated by comma, so the correct syntax have only to be

or

or we have to use a regex rule, as below

2nd,

Excuse my ignorance, i have a question ...

A set of tools for the decoding and conversion of AdBlock and filter lists. (https://github.com/PyFunceble/adblock-decoder#adblock-filter-list-decoder)

what is the intended behavior ?

extracting all domains for testing purpose (ACTIVE, INACTIVE or INVALID) <-- Case 1

or

extracting only domains that are safe to be blocked ? <-- Case 2

for the second case (safe), the tool have to extract only domains that flagged with the third party option or limited with the symbol ^ at the end.

||axample.com^$third-party
||ads.example.net^

I mean

$ grep -E "^\|\|[a-z0-9.-]+\^([\$]third-party)?$" adblock.list

more aggressive, include popups filters

||axample.com^$popup,third-party
||axample.com^$popup
grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list

clean output (with 0.0.0.0)

grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list | sed 's/\^.*//' | sed 's/||/0.0.0.0 /' > hosts.list
funilrys commented 1 year ago

@smed79 , I don't create such lists complex lists on my own, so I'm happy to have inputs from the community.

1st,

  • |http://example.com,https://example.de$script,image,domain=example.org|foo.example.net

There is no such cas in adblock (plus) syntax.

That's good to know. Will be fixed.

what is the intended behavior ?

Actually both. But I'm willing to make some changes. Please keep in mind that the adblock-decoder actually is a wrapper around the functionalities of PyFunceble.

What you describe as Case 1 is the behavior of the aggressive mode. Whether the Case2 should be the default behavior of PyFunceble.

for the second case (safe), the tool has to extract only domains that are flagged with the third-party option or limited with the symbol ^ at the end.

That's interesting. If everyone (cc: @Yuki2718 | @ryanbr | please flag others) agree on that, I can only see improvement.

I (and probably the community too) will be grateful if you could have the time to check the tests cases and let me know:

I'll then follow up with a complete rewrite of the decoder module.

Yuki2718 commented 1 year ago

TBH I don't understand what is the issue. I see ?adver=123.456 and .ppp are invalid, but don't know why the expected result is

kkk.lll
ads.mmm.nnn

only - what's wrong with extracting bbb.com? Seeing the test case, github.com and hello.world should be extracted from ~github.com,hello.world##.wrapper by default and be checked for their status, I have long assumed it's default behavior of PF but it isn't? Sure, comments should be skipped and ##[href^="https://funceble.funilrys.com/"] is something in between - personally I want this to be scanned as well but probably it should be optional so aggressive-only makes sense.

Yuki2718 commented 1 year ago

So I changed adblock_aggressive to true and scanned, then found it returns more domain than before which are all from cosmetic filters. @funilrys As said above, github.com and hello.world should be extracted from ~github.com,hello.world##.wrapper by default. Also can you add a command line argument --adblock_aggressive so that the aggressive mode can be used without editing yaml files?

smed79 commented 1 year ago

but don't know why the expected result is

kkk.lll
ads.mmm.nnn

In the other cases we are targeting a specific file¹, folder², request type³ (image, script ...) or applying the filter for a specific⁴ website.

||ttt.com/*/ban.js <--¹
||sss.com^*/img/ <--²
||uuu.com/*$script <--³
||eee.com/*$image,domain=fff.com <--⁴

So the above example, the output should not include ttt.com, sss.com, uuu.com, eee.com in the default behavior.

if an adblock list have the filter ||www.google.com/ads/* we wil not block www.google.com in our hosts file.

please flag others

@mapx- @okiehsch @Alex-302 @AdamWr @Khrin any test/comment will be appreciated (sure if you have some free time).

smed79 commented 1 year ago

@funilrys see my comments before the # sign

        {
            "subject": '##[href^="https://funceble.funilrys.com/"]',
            "expected": {
                "aggressive": ["funceble.funilrys.com"],
                "standard": [],
            },
        },
        {
            "subject": "||test.hello.world^$domain=hello.world",
            "expected": {
                "aggressive": ["hello.world", "test.hello.world"],
                "standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
            },
        },a
        {
            "subject": '##div[href^="http://funilrys.com/"]',
            "expected": {"aggressive": ["funilrys.com"], "standard": []},
        },
        {
            "subject": 'com##[href^="ftp://funceble.funilrys-funceble.com/"]',
            "expected": {
                "aggressive": ["funceble.funilrys-funceble.com"],
                "standard": [],
            },
        },
        {
            "subject": "!@@||funceble.world/js",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "!||world.hello/*ad.xml",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "!funilrys.com##body",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "[AdBlock Plus 2.0]",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "@@||ads.example.com/notbanner^$~script",
            "expected": {"aggressive": ["ads.example.com"], "standard": []},
        },
        {"subject": "/banner/*/img^", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "||ad.example.co.uk^",
            "expected": {
                "aggressive": ["ad.example.co.uk"],
                "standard": ["ad.example.co.uk"],
            },
        },
        {
            "subject": "||ad.example.fr^$image,test",
            "expected": {
                "aggressive": ["ad.example.fr"],
                "standard": ["ad.example.fr"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||api.funilrys.com/widget/$",
            "expected": {
                "aggressive": ["api.funilrys.com"],
                "standard": ["api.funilrys.com"], # should be null because we are targeting a specific file/folder
            },
        },
        {
            "subject": "||api.example.com/papi/action$popup",
            "expected": {
                "aggressive": ["api.example.com"],
                "standard": ["api.example.com"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||funilrys.github.io$script,image",
            "expected": {
                "aggressive": ["funilrys.github.io"],
                "standard": ["funilrys.github.io"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||example.net^$script,image",
            "expected": {"aggressive": ["example.net"], 
            "standard": ["example.net"]}, # should be null because we are targeting a specific request type
        },
        {
            "subject": "||static.hello.world.examoke.org/*/exit-banner.js",
            "expected": {
                "aggressive": ["static.hello.world.examoke.org"],
                "standard": ["static.hello.world.examoke.org"], # should be null because we are targeting a specific file
            },
        },
        {
            "subject": "$domain=exam.pl|elpmaxe.pl|example.pl",
            "expected": {
                "aggressive": ["elpmaxe.pl", "exam.pl", "example.pl"],
                "standard": [],
            },
        },
        {
            "subject": "||example.de^helloworld.com", # unlikely scenario to have a similar filter case
            "expected": {
                "aggressive": ["example.de"],
                "standard": ["example.de"],
            },
        },
        {
            "subject": "|github.io|", # unlikely scenario
            "expected": {"aggressive": ["github.io"], "standard": ["github.io"]},
        },
        {
            "subject": "~github.com,hello.world##.wrapper",
            "expected": {"aggressive": ["github.com", "hello.world"], "standard": []},
        },
        {
            "subject": "bing.com,bingo.com#@##adBanner",
            "expected": {"aggressive": ["bing.com", "bingo.com"], "standard": []},
        },
        {
            "subject": "example.org#@##test",
            "expected": {"aggressive": ["example.org"], "standard": []},
        },
        {
            "subject": "hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld", # incorrect filter (for element hiding rules, domains are separated with commas)
            "expected": {
                "aggressive": ["hubgit.com|oohay.com|ipa.elloh.dlorw"],
                "standard": [],
            },
        },
        {"subject": ".com", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "||ggggggggggg.gq^$all",
            "expected": {
                "aggressive": ["ggggggggggg.gq"],
                "standard": ["ggggggggggg.gq"],
            },
        },
        {
            "subject": "facebook.com##.search",
            "expected": {"aggressive": ["facebook.com"], "standard": []},
        },
        {
            "subject": "||test.hello.world^$domain=hello.world",
            "expected": {
                "aggressive": ["hello.world", "test.hello.world"],
                "standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
            },
        },
        {
            "subject": "||examplae.com",
            "expected": {"aggressive": ["examplae.com"], "standard": ["examplae.com"]},
        },
        {
            "subject": "||examplbe.com^",
            "expected": {"aggressive": ["examplbe.com"], "standard": ["examplbe.com"]},
        },
        {
            "subject": "||examplce.com$third-party",
            "expected": {"aggressive": ["examplce.com"], "standard": ["examplce.com"]},
        },
        {
            "subject": "||examplde.com^$third-party",
            "expected": {"aggressive": ["examplde.com"], "standard": ["examplde.com"]},
        },
        {
            "subject": '##[href^="https://examplee.com/"]',
            "expected": {"aggressive": ["examplee.com"], "standard": []},
        },
        {
            "subject": "||examplfe.com^examplge.com", # same as the case in the line 103
            "expected": {"aggressive": ["examplfe.com"], "standard": ["examplfe.com"]},
        },
        {
            "subject": "||examplhe.com$script,image", # same as the case in the line 56 and 84
            "expected": {"aggressive": ["examplhe.com"], "standard": ["examplhe.com"]},
        },
        {
            "subject": "||examplie.com^$domain=domain1.com|domain2.com",
            "expected": {
                "aggressive": [
                    "domain1.com",
                    "domain2.com",
                    "examplie.com",
                ],
                "standard": ["examplie.com"], # should be null because the filter is applyed for a specific website
            },
        },
        {
            "subject": 'examlple.com##[href^="http://hello.world."], '
            '[href^="http://example.net/"]',
            "expected": {
                "aggressive": ["examlple.com", "example.net", "hello.world."],
                "standard": [],
            },
        },
        {"subject": "##.ad-href1", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "^hello^$domain=example.com",
            "expected": {"standard": [], "aggressive": ["example.com"]},
        },
        {
            "subject": "hello$domain=example.net|example.com",
            "expected": {"standard": [], "aggressive": ["example.com", "example.net"]},
        },
        {
            "subject": "hello^$domain=example.org|example.com|example.net",
            "expected": {
                "standard": [],
                "aggressive": ["example.com", "example.net", "example.org"],
            },
        },
        {
            "subject": "|http://example.org/hello-world^$scripts,image",
            "expected": {"aggressive": ["example.org"], 
            "standard": ["example.org"]}, # should be null because we are targeting a specific file/folder for a specific request type
        },
        {
            "subject": "|http://example.org/*",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|http://example.org^",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|http://example.org",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|https://example.org/^$domain=example.com",
            "expected": {
                "aggressive": ["example.com", "example.org"],
                "standard": ["example.org"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|ftp://example.org$domain=example.com|example.net",
            "expected": {
                "aggressive": ["example.com", "example.net", "example.org"],
                "standard": ["example.org"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|http://example.com$script,image,domain=example.org|foo.example.net",
            "expected": {
                "aggressive": ["example.com", "example.org", "foo.example.net"],
                "standard": ["example.com"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net", # incorrect filter (not possible to block many sites in the same filter or we have to use a regex rule)
            "expected": {
                "aggressive": [
                    "example.com",
                    "example.de",
                    "example.org",
                    "foo.example.net",
                ],
                "standard": ["example.com", "example.de"],
            },
        },
    ]
Yuki2718 commented 1 year ago

If it only scans for rules to block the entire domain, --adblock option does not make much sense. What we expect for PF with --adblock is to check status of domain in all ABP (and AG, uBO, etc.) rules. I found even ##[href^="https://funceble.funilrys.com/"] case helps to pick up potentially obsolete rules (sure, generally href being dead does not mean the rule is obsolete though).

smed79 commented 1 year ago

If so, then it's a misunderstanding of the tool by me. for that i asked above what is the intended behavior (2nd).

:confused: I thought we can use the tool to convert an adblock list to blacklist hosts file (Case 2).