Open smed79 opened 1 year ago
@smed79 please review the testcases before I deploy/release my change: https://github.com/funilrys/PyFunceble/commit/d32914b7dd1381f47e7e96d76458e986c3550cf0#diff-6fbb548d14d904b48cdaa09ea8c1ca04249d69cef0763217ac957605c50548a6R278-R326
Let me know If I missed a test case.
Stay safe and healthy! Thank you for your patience.
1st,
|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net
There is no such cas in adblock (plus) syntax.
blocked requests (files or domains) cannot be separated by comma, so the correct syntax have only to be
|http://example.com$script,image,domain=example.org|foo.example.net
|https://example.de$script,image,domain=example.org|foo.example.net
or
||example.com$script,image,domain=example.org|foo.example.net
||example.de$script,image,domain=example.org|foo.example.net
or we have to use a regex rule, as below
/^https?:\/\/(example\.com|example\.de)\//$script,image,domain=example.org|foo.example.net
2nd,
Excuse my ignorance, i have a question ...
A set of tools for the decoding and conversion of AdBlock and filter lists. (https://github.com/PyFunceble/adblock-decoder#adblock-filter-list-decoder)
what is the intended behavior ?
extracting all domains for testing purpose (ACTIVE, INACTIVE or INVALID) <-- Case 1
or
extracting only domains that are safe to be blocked ? <-- Case 2
for the second case (safe), the tool have to extract only domains that flagged with the third party
option or limited with the symbol ^
at the end.
||axample.com^$third-party
||ads.example.net^
I mean
$ grep -E "^\|\|[a-z0-9.-]+\^([\$]third-party)?$" adblock.list
more aggressive, include popups filters
||axample.com^$popup,third-party
||axample.com^$popup
grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list
clean output (with 0.0.0.0)
grep -E "^\|\|[a-z0-9.-]+\^([$](popup([,](third-party)?)?|third-party))?$" adblock.list | sed 's/\^.*//' | sed 's/||/0.0.0.0 /' > hosts.list
@smed79 , I don't create such lists complex lists on my own, so I'm happy to have inputs from the community.
1st,
|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net
There is no such cas in adblock (plus) syntax.
That's good to know. Will be fixed.
what is the intended behavior ?
Actually both. But I'm willing to make some changes. Please keep in mind that the adblock-decoder actually is a wrapper around the functionalities of PyFunceble.
What you describe as Case 1
is the behavior of the aggressive mode. Whether the Case2
should be the default behavior of PyFunceble.
for the second case (safe), the tool has to extract only domains that are flagged with the
third-party
option or limited with the symbol^
at the end.
That's interesting. If everyone (cc: @Yuki2718 | @ryanbr | please flag others) agree on that, I can only see improvement.
I (and probably the community too) will be grateful if you could have the time to check the tests cases and let me know:
I'll then follow up with a complete rewrite of the decoder module.
TBH I don't understand what is the issue. I see ?adver=123.456
and .ppp
are invalid, but don't know why the expected result is
kkk.lll
ads.mmm.nnn
only - what's wrong with extracting bbb.com
? Seeing the test case, github.com
and hello.world
should be extracted from ~github.com,hello.world##.wrapper
by default and be checked for their status, I have long assumed it's default behavior of PF but it isn't? Sure, comments should be skipped and ##[href^="https://funceble.funilrys.com/"]
is something in between - personally I want this to be scanned as well but probably it should be optional so aggressive-only makes sense.
So I changed adblock_aggressive
to true
and scanned, then found it returns more domain than before which are all from cosmetic filters.
@funilrys As said above, github.com
and hello.world
should be extracted from ~github.com,hello.world##.wrapper
by default. Also can you add a command line argument --adblock_aggressive
so that the aggressive mode can be used without editing yaml files?
but don't know why the expected result is
kkk.lll ads.mmm.nnn
In the other cases we are targeting a specific file¹, folder², request type³ (image, script ...) or applying the filter for a specific⁴ website.
||ttt.com/*/ban.js <--¹
||sss.com^*/img/ <--²
||uuu.com/*$script <--³
||eee.com/*$image,domain=fff.com <--⁴
So the above example, the output should not include ttt.com
, sss.com
, uuu.com
, eee.com
in the default behavior.
if an adblock list have the filter ||www.google.com/ads/*
we wil not block www.google.com
in our hosts file.
please flag others
@mapx- @okiehsch @Alex-302 @AdamWr @Khrin any test/comment will be appreciated (sure if you have some free time).
@funilrys see my comments before the #
sign
{
"subject": '##[href^="https://funceble.funilrys.com/"]',
"expected": {
"aggressive": ["funceble.funilrys.com"],
"standard": [],
},
},
{
"subject": "||test.hello.world^$domain=hello.world",
"expected": {
"aggressive": ["hello.world", "test.hello.world"],
"standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
},
},a
{
"subject": '##div[href^="http://funilrys.com/"]',
"expected": {"aggressive": ["funilrys.com"], "standard": []},
},
{
"subject": 'com##[href^="ftp://funceble.funilrys-funceble.com/"]',
"expected": {
"aggressive": ["funceble.funilrys-funceble.com"],
"standard": [],
},
},
{
"subject": "!@@||funceble.world/js",
"expected": {"aggressive": [], "standard": []},
},
{
"subject": "!||world.hello/*ad.xml",
"expected": {"aggressive": [], "standard": []},
},
{
"subject": "!funilrys.com##body",
"expected": {"aggressive": [], "standard": []},
},
{
"subject": "[AdBlock Plus 2.0]",
"expected": {"aggressive": [], "standard": []},
},
{
"subject": "@@||ads.example.com/notbanner^$~script",
"expected": {"aggressive": ["ads.example.com"], "standard": []},
},
{"subject": "/banner/*/img^", "expected": {"aggressive": [], "standard": []}},
{
"subject": "||ad.example.co.uk^",
"expected": {
"aggressive": ["ad.example.co.uk"],
"standard": ["ad.example.co.uk"],
},
},
{
"subject": "||ad.example.fr^$image,test",
"expected": {
"aggressive": ["ad.example.fr"],
"standard": ["ad.example.fr"], # should be null because we are targeting a specific request type
},
},
{
"subject": "||api.funilrys.com/widget/$",
"expected": {
"aggressive": ["api.funilrys.com"],
"standard": ["api.funilrys.com"], # should be null because we are targeting a specific file/folder
},
},
{
"subject": "||api.example.com/papi/action$popup",
"expected": {
"aggressive": ["api.example.com"],
"standard": ["api.example.com"], # should be null because we are targeting a specific request type
},
},
{
"subject": "||funilrys.github.io$script,image",
"expected": {
"aggressive": ["funilrys.github.io"],
"standard": ["funilrys.github.io"], # should be null because we are targeting a specific request type
},
},
{
"subject": "||example.net^$script,image",
"expected": {"aggressive": ["example.net"],
"standard": ["example.net"]}, # should be null because we are targeting a specific request type
},
{
"subject": "||static.hello.world.examoke.org/*/exit-banner.js",
"expected": {
"aggressive": ["static.hello.world.examoke.org"],
"standard": ["static.hello.world.examoke.org"], # should be null because we are targeting a specific file
},
},
{
"subject": "$domain=exam.pl|elpmaxe.pl|example.pl",
"expected": {
"aggressive": ["elpmaxe.pl", "exam.pl", "example.pl"],
"standard": [],
},
},
{
"subject": "||example.de^helloworld.com", # unlikely scenario to have a similar filter case
"expected": {
"aggressive": ["example.de"],
"standard": ["example.de"],
},
},
{
"subject": "|github.io|", # unlikely scenario
"expected": {"aggressive": ["github.io"], "standard": ["github.io"]},
},
{
"subject": "~github.com,hello.world##.wrapper",
"expected": {"aggressive": ["github.com", "hello.world"], "standard": []},
},
{
"subject": "bing.com,bingo.com#@##adBanner",
"expected": {"aggressive": ["bing.com", "bingo.com"], "standard": []},
},
{
"subject": "example.org#@##test",
"expected": {"aggressive": ["example.org"], "standard": []},
},
{
"subject": "hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld", # incorrect filter (for element hiding rules, domains are separated with commas)
"expected": {
"aggressive": ["hubgit.com|oohay.com|ipa.elloh.dlorw"],
"standard": [],
},
},
{"subject": ".com", "expected": {"aggressive": [], "standard": []}},
{
"subject": "||ggggggggggg.gq^$all",
"expected": {
"aggressive": ["ggggggggggg.gq"],
"standard": ["ggggggggggg.gq"],
},
},
{
"subject": "facebook.com##.search",
"expected": {"aggressive": ["facebook.com"], "standard": []},
},
{
"subject": "||test.hello.world^$domain=hello.world",
"expected": {
"aggressive": ["hello.world", "test.hello.world"],
"standard": ["test.hello.world"], # should be null because the filter is applyed for a specific website
},
},
{
"subject": "||examplae.com",
"expected": {"aggressive": ["examplae.com"], "standard": ["examplae.com"]},
},
{
"subject": "||examplbe.com^",
"expected": {"aggressive": ["examplbe.com"], "standard": ["examplbe.com"]},
},
{
"subject": "||examplce.com$third-party",
"expected": {"aggressive": ["examplce.com"], "standard": ["examplce.com"]},
},
{
"subject": "||examplde.com^$third-party",
"expected": {"aggressive": ["examplde.com"], "standard": ["examplde.com"]},
},
{
"subject": '##[href^="https://examplee.com/"]',
"expected": {"aggressive": ["examplee.com"], "standard": []},
},
{
"subject": "||examplfe.com^examplge.com", # same as the case in the line 103
"expected": {"aggressive": ["examplfe.com"], "standard": ["examplfe.com"]},
},
{
"subject": "||examplhe.com$script,image", # same as the case in the line 56 and 84
"expected": {"aggressive": ["examplhe.com"], "standard": ["examplhe.com"]},
},
{
"subject": "||examplie.com^$domain=domain1.com|domain2.com",
"expected": {
"aggressive": [
"domain1.com",
"domain2.com",
"examplie.com",
],
"standard": ["examplie.com"], # should be null because the filter is applyed for a specific website
},
},
{
"subject": 'examlple.com##[href^="http://hello.world."], '
'[href^="http://example.net/"]',
"expected": {
"aggressive": ["examlple.com", "example.net", "hello.world."],
"standard": [],
},
},
{"subject": "##.ad-href1", "expected": {"aggressive": [], "standard": []}},
{
"subject": "^hello^$domain=example.com",
"expected": {"standard": [], "aggressive": ["example.com"]},
},
{
"subject": "hello$domain=example.net|example.com",
"expected": {"standard": [], "aggressive": ["example.com", "example.net"]},
},
{
"subject": "hello^$domain=example.org|example.com|example.net",
"expected": {
"standard": [],
"aggressive": ["example.com", "example.net", "example.org"],
},
},
{
"subject": "|http://example.org/hello-world^$scripts,image",
"expected": {"aggressive": ["example.org"],
"standard": ["example.org"]}, # should be null because we are targeting a specific file/folder for a specific request type
},
{
"subject": "|http://example.org/*",
"expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
},
{
"subject": "|http://example.org^",
"expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
},
{
"subject": "|http://example.org",
"expected": {"aggressive": ["example.org"], "standard": ["example.org"]},
},
{
"subject": "|https://example.org/^$domain=example.com",
"expected": {
"aggressive": ["example.com", "example.org"],
"standard": ["example.org"], # should be null because the filter is applyed for a specific websites
},
},
{
"subject": "|ftp://example.org$domain=example.com|example.net",
"expected": {
"aggressive": ["example.com", "example.net", "example.org"],
"standard": ["example.org"], # should be null because the filter is applyed for a specific websites
},
},
{
"subject": "|http://example.com$script,image,domain=example.org|foo.example.net",
"expected": {
"aggressive": ["example.com", "example.org", "foo.example.net"],
"standard": ["example.com"], # should be null because the filter is applyed for a specific websites
},
},
{
"subject": "|http://example.com,https://example.de$script,image,domain=example.org|foo.example.net", # incorrect filter (not possible to block many sites in the same filter or we have to use a regex rule)
"expected": {
"aggressive": [
"example.com",
"example.de",
"example.org",
"foo.example.net",
],
"standard": ["example.com", "example.de"],
},
},
]
If it only scans for rules to block the entire domain, --adblock
option does not make much sense. What we expect for PF with --adblock
is to check status of domain in all ABP (and AG, uBO, etc.) rules. I found even ##[href^="https://funceble.funilrys.com/"]
case helps to pick up potentially obsolete rules (sure, generally href being dead does not mean the rule is obsolete though).
If so, then it's a misunderstanding of the tool by me. for that i asked above what is the intended behavior (2nd).
:confused: I thought we can use the tool to convert an adblock list to blacklist hosts file (Case 2).
Testing the below list
Output
The expected output should be