Invalid output - Githubissues

Testing the below list

||com/*?adver=123.456
||aaa.*/ads1/
||aaa.*/ads2/
||bbb.com^*ads3
||ccc.com^*ads4
||ddd.com*.ads.com
||eee.com/*$image,domain=fff.com
||ggg.hhh-*
||ggg.hhh-*
||ggg.hhh*iii-jjj
|http://kkk.com/ads/*
|https://kkk.lll/*
|https://ads.mmm.nnn^
||ooo.com/*.ppp
||qqq.com/img1*.ads8
||qqq.com/img2/*.ads9
||qqq.com/img3/.*./ads0
||qqq.com/img4/*.*/
||qqq.com/img5/*..*/
||rrr.com/*.php?123
||sss.com^*/img/
||ttt.com/*/ban.js
||uuu.com/*$script

Output

?adver=123.456
aaa.
aaa.
bbb.com
ccc.com
.ads.com
ddd.com
eee.com
ggg.hhh-
ggg.hhh-
ggg.hhh
.ppp
ooo.com
.ads8
qqq.com
.ads9
qqq.com
.
qqq.com
.
qqq.com
..
qqq.com
.php
rrr.com
sss.com
ttt.com
uuu.com

The expected output should be

kkk.lll
ads.mmm.nnn

@smed79 please review the testcases before I deploy/release my change: https://github.com/funilrys/PyFunceble/commit/d32914b7dd1381f47e7e96d76458e986c3550cf0#diff-6fbb548d14d904b48cdaa09ea8c1ca04249d69cef0763217ac957605c50548a6R278-R326

Let me know If I missed a test case.

Stay safe and healthy! Thank you for your patience.

1st,

|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net

There is no such cas in adblock (plus) syntax.

blocked requests (files or domains) cannot be separated by comma, so the correct syntax have only to be

|http://example.com$script,image,domain=example.org|foo.example.net
|https://example.de$script,image,domain=example.org|foo.example.net

||example.com$script,image,domain=example.org|foo.example.net
||example.de$script,image,domain=example.org|foo.example.net

or we have to use a regex rule, as below

/^https?:\/\/(example\.com|example\.de)\//$script,image,domain=example.org|foo.example.net

2nd,

Excuse my ignorance, i have a question ...

A set of tools for the decoding and conversion of AdBlock and filter lists. (https://github.com/PyFunceble/adblock-decoder#adblock-filter-list-decoder)

what is the intended behavior ?

extracting all domains for testing purpose (ACTIVE, INACTIVE or INVALID) <-- Case 1

extracting only domains that are safe to be blocked ? <-- Case 2

for the second case (safe), the tool have to extract only domains that flagged with the third party option or limited with the symbol ^ at the end.

||axample.com^$third-party
||ads.example.net^

I mean

$ grep -E "^\|\|[a-z0-9.-]+\^([\$]third-party)?$" adblock.list

more aggressive, include popups filters

||axample.com^$popup,third-party
||axample.com^$popup

grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list

clean output (with 0.0.0.0)

grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list | sed 's/\^.*//' | sed 's/||/0.0.0.0 /' > hosts.list

@smed79 , I don't create such lists complex lists on my own, so I'm happy to have inputs from the community.

1st,

|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net

There is no such cas in adblock (plus) syntax.

That's good to know. Will be fixed.

what is the intended behavior ?

Actually both. But I'm willing to make some changes. Please keep in mind that the adblock-decoder actually is a wrapper around the functionalities of PyFunceble.

What you describe as Case 1 is the behavior of the aggressive mode. Whether the Case2 should be the default behavior of PyFunceble.

for the second case (safe), the tool has to extract only domains that are flagged with the third-party option or limited with the symbol ^ at the end.

That's interesting. If everyone (cc: @Yuki2718 | @ryanbr | please flag others) agree on that, I can only see improvement.

I (and probably the community too) will be grateful if you could have the time to check the tests cases and let me know:

what is wrong
what should be changed
what should be ignored
what is missing

I'll then follow up with a complete rewrite of the decoder module.

TBH I don't understand what is the issue. I see ?adver=123.456 and .ppp are invalid, but don't know why the expected result is

kkk.lll
ads.mmm.nnn

only - what's wrong with extracting bbb.com? Seeing the test case, github.com and hello.world should be extracted from ~github.com,hello.world##.wrapper by default and be checked for their status, I have long assumed it's default behavior of PF but it isn't? Sure, comments should be skipped and ##[href^="https://funceble.funilrys.com/"] is something in between - personally I want this to be scanned as well but probably it should be optional so aggressive-only makes sense.

So I changed adblock_aggressive to true and scanned, then found it returns more domain than before which are all from cosmetic filters. @funilrys As said above, github.com and hello.world should be extracted from ~github.com,hello.world##.wrapper by default. Also can you add a command line argument --adblock_aggressive so that the aggressive mode can be used without editing yaml files?

but don't know why the expected result is
kkk.lll
ads.mmm.nnn

In the other cases we are targeting a specific file¹, folder², request type³ (image, script ...) or applying the filter for a specific⁴ website.

||ttt.com/*/ban.js <--¹
||sss.com^*/img/ <--²
||uuu.com/*$script <--³
||eee.com/*$image,domain=fff.com <--⁴

So the above example, the output should not include ttt.com, sss.com, uuu.com, eee.com in the default behavior.

if an adblock list have the filter ||www.google.com/ads/* we wil not block www.google.com in our hosts file.

please flag others

@mapx- @okiehsch @Alex-302 @AdamWr @Khrin any test/comment will be appreciated (sure if you have some free time).

@funilrys see my comments before the # sign

        {
            "subject": '##[href^="https://funceble.funilrys.com/"]',
            "expected": {
                "aggressive": ["funceble.funilrys.com"],
                "standard": [],
            },
        },
        {
            "subject": "||test.hello.world^$domain=hello.world",
            "expected": {
                "aggressive": ["hello.world", "test.hello.world"],
                "standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
            },
        },a
        {
            "subject": '##div[href^="http://funilrys.com/"]',
            "expected": {"aggressive": ["funilrys.com"], "standard": []},
        },
        {
            "subject": 'com##[href^="ftp://funceble.funilrys-funceble.com/"]',
            "expected": {
                "aggressive": ["funceble.funilrys-funceble.com"],
                "standard": [],
            },
        },
        {
            "subject": "!@@||funceble.world/js",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "!||world.hello/*ad.xml",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "!funilrys.com##body",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "[AdBlock Plus 2.0]",
            "expected": {"aggressive": [], "standard": []},
        },
        {
            "subject": "@@||ads.example.com/notbanner^$~script",
            "expected": {"aggressive": ["ads.example.com"], "standard": []},
        },
        {"subject": "/banner/*/img^", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "||ad.example.co.uk^",
            "expected": {
                "aggressive": ["ad.example.co.uk"],
                "standard": ["ad.example.co.uk"],
            },
        },
        {
            "subject": "||ad.example.fr^$image,test",
            "expected": {
                "aggressive": ["ad.example.fr"],
                "standard": ["ad.example.fr"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||api.funilrys.com/widget/$",
            "expected": {
                "aggressive": ["api.funilrys.com"],
                "standard": ["api.funilrys.com"], # should be null because we are targeting a specific file/folder
            },
        },
        {
            "subject": "||api.example.com/papi/action$popup",
            "expected": {
                "aggressive": ["api.example.com"],
                "standard": ["api.example.com"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||funilrys.github.io$script,image",
            "expected": {
                "aggressive": ["funilrys.github.io"],
                "standard": ["funilrys.github.io"], # should be null because we are targeting a specific request type
            },
        },
        {
            "subject": "||example.net^$script,image",
            "expected": {"aggressive": ["example.net"], 
            "standard": ["example.net"]}, # should be null because we are targeting a specific request type
        },
        {
            "subject": "||static.hello.world.examoke.org/*/exit-banner.js",
            "expected": {
                "aggressive": ["static.hello.world.examoke.org"],
                "standard": ["static.hello.world.examoke.org"], # should be null because we are targeting a specific file
            },
        },
        {
            "subject": "$domain=exam.pl|elpmaxe.pl|example.pl",
            "expected": {
                "aggressive": ["elpmaxe.pl", "exam.pl", "example.pl"],
                "standard": [],
            },
        },
        {
            "subject": "||example.de^helloworld.com", # unlikely scenario to have a similar filter case
            "expected": {
                "aggressive": ["example.de"],
                "standard": ["example.de"],
            },
        },
        {
            "subject": "|github.io|", # unlikely scenario
            "expected": {"aggressive": ["github.io"], "standard": ["github.io"]},
        },
        {
            "subject": "~github.com,hello.world##.wrapper",
            "expected": {"aggressive": ["github.com", "hello.world"], "standard": []},
        },
        {
            "subject": "bing.com,bingo.com#@##adBanner",
            "expected": {"aggressive": ["bing.com", "bingo.com"], "standard": []},
        },
        {
            "subject": "example.org#@##test",
            "expected": {"aggressive": ["example.org"], "standard": []},
        },
        {
            "subject": "hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld", # incorrect filter (for element hiding rules, domains are separated with commas)
            "expected": {
                "aggressive": ["hubgit.com|oohay.com|ipa.elloh.dlorw"],
                "standard": [],
            },
        },
        {"subject": ".com", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "||ggggggggggg.gq^$all",
            "expected": {
                "aggressive": ["ggggggggggg.gq"],
                "standard": ["ggggggggggg.gq"],
            },
        },
        {
            "subject": "facebook.com##.search",
            "expected": {"aggressive": ["facebook.com"], "standard": []},
        },
        {
            "subject": "||test.hello.world^$domain=hello.world",
            "expected": {
                "aggressive": ["hello.world", "test.hello.world"],
                "standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
            },
        },
        {
            "subject": "||examplae.com",
            "expected": {"aggressive": ["examplae.com"], "standard": ["examplae.com"]},
        },
        {
            "subject": "||examplbe.com^",
            "expected": {"aggressive": ["examplbe.com"], "standard": ["examplbe.com"]},
        },
        {
            "subject": "||examplce.com$third-party",
            "expected": {"aggressive": ["examplce.com"], "standard": ["examplce.com"]},
        },
        {
            "subject": "||examplde.com^$third-party",
            "expected": {"aggressive": ["examplde.com"], "standard": ["examplde.com"]},
        },
        {
            "subject": '##[href^="https://examplee.com/"]',
            "expected": {"aggressive": ["examplee.com"], "standard": []},
        },
        {
            "subject": "||examplfe.com^examplge.com", # same as the case in the line 103
            "expected": {"aggressive": ["examplfe.com"], "standard": ["examplfe.com"]},
        },
        {
            "subject": "||examplhe.com$script,image", # same as the case in the line 56 and 84
            "expected": {"aggressive": ["examplhe.com"], "standard": ["examplhe.com"]},
        },
        {
            "subject": "||examplie.com^$domain=domain1.com|domain2.com",
            "expected": {
                "aggressive": [
                    "domain1.com",
                    "domain2.com",
                    "examplie.com",
                ],
                "standard": ["examplie.com"], # should be null because the filter is applyed for a specific website
            },
        },
        {
            "subject": 'examlple.com##[href^="http://hello.world."], '
            '[href^="http://example.net/"]',
            "expected": {
                "aggressive": ["examlple.com", "example.net", "hello.world."],
                "standard": [],
            },
        },
        {"subject": "##.ad-href1", "expected": {"aggressive": [], "standard": []}},
        {
            "subject": "^hello^$domain=example.com",
            "expected": {"standard": [], "aggressive": ["example.com"]},
        },
        {
            "subject": "hello$domain=example.net|example.com",
            "expected": {"standard": [], "aggressive": ["example.com", "example.net"]},
        },
        {
            "subject": "hello^$domain=example.org|example.com|example.net",
            "expected": {
                "standard": [],
                "aggressive": ["example.com", "example.net", "example.org"],
            },
        },
        {
            "subject": "|http://example.org/hello-world^$scripts,image",
            "expected": {"aggressive": ["example.org"], 
            "standard": ["example.org"]}, # should be null because we are targeting a specific file/folder for a specific request type
        },
        {
            "subject": "|http://example.org/*",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|http://example.org^",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|http://example.org",
            "expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
        },
        {
            "subject": "|https://example.org/^$domain=example.com",
            "expected": {
                "aggressive": ["example.com", "example.org"],
                "standard": ["example.org"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|ftp://example.org$domain=example.com|example.net",
            "expected": {
                "aggressive": ["example.com", "example.net", "example.org"],
                "standard": ["example.org"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|http://example.com$script,image,domain=example.org|foo.example.net",
            "expected": {
                "aggressive": ["example.com", "example.org", "foo.example.net"],
                "standard": ["example.com"], # should be null because the filter is applyed for a specific websites
            },
        },
        {
            "subject": "|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net", # incorrect filter (not possible to block many sites in the same filter or we have to use a regex rule)
            "expected": {
                "aggressive": [
                    "example.com",
                    "example.de",
                    "example.org",
                    "foo.example.net",
                ],
                "standard": ["example.com", "example.de"],
            },
        },
    ]

If it only scans for rules to block the entire domain, --adblock option does not make much sense. What we expect for PF with --adblock is to check status of domain in all ABP (and AG, uBO, etc.) rules. I found even ##[href^="https://funceble.funilrys.com/"] case helps to pick up potentially obsolete rules (sure, generally href being dead does not mean the rule is obsolete though).

If so, then it's a misunderstanding of the tool ^{by me}. for that i asked above what is the intended behavior (^2nd).

:confused: ^{I thought we can use the tool to convert an adblock list to blacklist hosts file} (^{Case 2}).

PyFunceble / adblock-decoder

Invalid output #3