Closed funilrys closed 3 years ago
Caught another bug: This section in the original lists are rules that removing elements(only) from legit sites:
After PyFunceble
filter the lists the end result in domain.list
is:
Notice how legit sites are being blocked?
Okay you have to explain me AdBlock then @dnmTX :smile_cat: I'm not a big fan of it as its syntax is confusing.
So how do I differ legit from bad site in adblock ? I though that adblock was only about blocking not whitelisting :thinking:
OK......
i'll do the basics only to be more clear:
If you want to block domain you need to add || in the front and ^ at the end(it will catch the subdomains as well)
If you want to block just element in that website,you need to find it(chrome dev-tools helps a lot with that) and add ## after the domain name.Example:
Open yahoo.com
(i removed soooo many elemnts from that page you wouldn't recognize it).
Now...look at your yahoo page and compare to mine:
Much cleaner,no videos,no annoyances.
Rules examples:
yahoo.com###applet_p_50000278
yahoo.com###applet_p_32209491
yahoo.com###applet_p_50000277
yahoo.com###applet_p_63802
yahoo.com###applet_p_63796
yahoo.com###sticky-lrec2-footer
Okay so what about this format ? Which of the following mark the domain as a bad or good boy ?
||google.com$script,image
||api.google.com/papi/action$popup
facebook.com###player-above-2
~github.com,hello.world##
@@||cnn.com/*ad.xml
!||world.hello/*ad.xml
!@@||funceble.world/js
yahoo.com,msn.com,api.hello.world#@#awesomeWorld
!funilrys.com##body
hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld
I know you will not find them in real world but they are part of the tests for the decoder.
The ##[href^=....
it's different.it's embedded in the iframe
and this how you blocking those domains
Okay I'm working on that implementation.
So in this
hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld
they are all legit right ?
||google.com$script,image -this rule will not allow any scripts or images to be shown or executed on that domain ||api.google.com/papi/action$popup -this rule will stop the popup coming from that link facebook.com###player-above-2 -this one will hide element(looks like a video player) on that page ~github.com,hello.world## -hmmmm haven't seen this one @@||cnn.com/ad.xml -this rule will whitelist that link on the webpage (@@ in front is whitelisting) !||world.hello/ad.xml -this will block it(! in the front is comment) !@@||funceble.world/js -this will whitelist that js script (! in the front is comment) yahoo.com,msn.com,api.hello.world#@#awesomeWorld -don't know !funilrys.com##body -this will block element hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld -don't know
Stay put,let me do some research on #@#
rule cause i'm using AdGuard and haven't seen such a rule there
Ok... in the above example the rule #@#
allows(whitelists) that particular element on the listed domains,so yes,all those domains are legit
Okay let me implement this issue first with the current format will then review with you for all tests as those need some hotfix. Never thought about whitelisting :joy_cat:
I know,it's Java,more complex.Took me a while to get around it but i'm getting there
@funilrys make it simple.Everything that has || in front and ^ at the end
stays.
Everything that has href
in it stays(filtered of course to leave the domain only).The rest should be removed as it's rules that don't really concern any of us who will use the lists in hosts format.
Yeah but if I do that, I'll invalidate AdBlock/filter list like https://github.com/MajkiIT/polish-ads-filter :smile_cat:
Only need to take some time to understand how it works properly then will clean the mess I created!
Look at this one for example.In it,all legit domains with rules to block certain elements only.
Ok,i know it will take time but meanwhile,for everyone who uses the lists with dnsmasq etc etc and not adblockers. Can you PLEASE add https://raw.githubusercontent.com/Dawsey21/Lists/master/main-blacklist.txt to be filtered properly.
Also you can start here,it's very well explained and will help you understand the basics: https://kb.adguard.com/en/general/how-to-create-your-own-ad-filters
Ok,i know it will take time but meanwhile,for everyone who uses the lists with dnsmasq etc etc and not adblockers. Can you PLEASE add https://raw.githubusercontent.com/Dawsey21/Lists/master/main-blacklist.txt to be filtered properly.
PLEASE
@dnmTX ,
PyFunceble is fixed, please look at the tests for details.
As you mentioned, there was really an issue with my way of handling adblock lists. Therefor here is the eratum:
Please understand by self.expected
the list of extracted domains from the given input (self.lines
).
self.lines = [
"||funilrys.github.io$script,image",
"||google.com^$script,image",
"||twitter.com^helloworld.com",
"||api.google.com/papi/action$popup",
"facebook.com###player-above-2",
"~github.com,hello.world##.wrapper",
"@@||cnn.com/*ad.xml",
"!||world.hello/*ad.xml",
"bing.com,bingo.com#@##adBanner",
"!@@||funceble.world/js",
"yahoo.com,~msn.com,api.hello.world#@#awesomeWorld",
"!funilrys.com##body",
"hello#@#badads",
"hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld",
'##[href^="https://funceble.funilrys.com/"]',
"[AdBlock Plus 2.0]",
'##div[href^="http://funilrys.com/"]',
'com##[href^="ftp://funceble.funilrys-funceble.com/"]',
"/banner/*/img^" "|github.io|",
"|github.io|",
"||api.funilrys.com/widget/$",
]
self.expected = [
"funilrys.github.io",
"google.com",
"twitter.com",
"api.google.com",
"funceble.funilrys.com",
"funilrys.com",
"github.io",
"api.funilrys.com",
]
As the tests were passed without any issue (cf.) I can attest that the next release and the current development version do not take any false positive anymore.
Please let me know if there is something else.
This issue will be closed on next release!
Cheers, Nissar
@funilrys from what i can tell and understand is self.lines
is the example of if there is any domains there not to be added for filtering as they are legit? Am i close?
What about anything with ##div[href^=...
,those are usually bad ones that need blocking?
Another thing(just to make sure).Example:
||api.funilrys.com/widget/$
what this is is partial link that could be api.funilrys.com/widget/bla/bla/bla/ad.js
and the adblocker will catch it but the thing is that because that domain is hosting some ad or telemetry script(google usually does that) that doesn't mean that the actual domain is bad.The question is if there is certain rule how that domain will be considered,as bad or as good?
@dnmTX self.lines
contains random lines that can be found in regular AdBlock. The objective of the code I write/wrote is to get as output the list self.expected
which is in more practical way, what we are going to test (the bad ones).
So from your point of view self.expected
represent the bad one we have to test.
About ##div[href^=...
It's there because usually you have ##[href^=...
but those variant also exist:
##div[href^=...
com##div[href^=...
com##[href^=...
With my review, the domain which is in the href attribute is extracted and formatted (remove protocol and "decorators") :smile_cat:
Actually from my point of view the self.expected
should be considered the good ones with exception of everything that has href
in it.
Bad ones should start with ||
and end with ^
including all the href
variations.
Wow you lost me :joy_cat:
For clarification, those are example of format do not consider those domains we are only talking about extracted domain from matched format :smile_cat:
| Expected/Extracted/Tested by PyFunceble | Line (example) |
|------------------------------------------ |-------------------------------------------- |
| funilrys.github.io | ||funilrys.github.io$script,image |
| google.com | ||google.com^$script,image |
| twitter.com | ||twitter.com^helloworld.com |
| api.google.com | ||api.google.com/papi/action$popup |
| funceble.funilrys.com | ##[href^="https://funceble.funilrys.com/"] |
| funilrys.com | ##div[href^="http://funilrys.com/"] |
| github.io | |github.io| |
| api.funilrys.com | ||api.funilrys.com/widget/$ |
Also if we match for example hello.world##ad-selector
we do not extract hello.world
as a bad one.
Maybe I misunderstood something :thinking:
Also if we match for example hello.world##ad-selector we do not extract hello.world as a bad one.
Ok,that's good,that's how it's suppose to be but......
If we match ||api.google.com/papi/action$popup
do we extract api.google.com
as bad one or not?
This is where the tricky part is,cause in this example api.google.com
is very legit domain that hosts ad scripts and so on but also hosts things that without them the web page will be broken.
If we have ||api.google.com/papi/action$popup
or for example ||api.example.org/pap/hello$popup
the system will extract, test and produce result respectively for api.google.com
and api.example.org
.
basically you saying that api.google.com
will be blocked?
What if you have let's say: ||yahoo.com/papi/action$popup
so the end result will be:
0.0.0.0 yahoo.com in ACTIVE
folder.
You're thinking about after.
PyFunceble work with the data you provide. Which means that if you decide to have api.google.com
in your list, PyFunceble will test it. If you decide to have api.google.com
in the hosts file to test, PyFunceble will test it. If you decide to provide ||api.google.com/papi/action$popup
into your adblock list, PyFunceble will extract and test it. PyFunceble is a global tool which does what he was told to: Check the availability of a given domain, IPv4 or URL.
What you do with the results and data is what you want. That's why there is whitelist in project like Ultimate. Because we all know that false positive will always be there in such big compilation. And we are not talking about maintainers who block for example google.com
.
It's their list, our tests, our compilation but we still have to deal with whitelisting because the upstream maintainer may not want to whitelist x or y even if they are legit and not harmful.
By the way if you're looking for a whitelisting script we have one at https://github.com/Ultimate-Hosts-Blacklist/dev-center/tree/whitelisting :smile_cat:
Ok,i see now.You were more concerned about how to properly extract the domains,i was worry more about false positives(as it's understandable cause i'm the end user).Basically for me not to worry about that cause i'm really trying to automate everything and not even think about it can't you PLEASE(again) add the other LISTS and let's be done with it.
P.S. I can't use Python scripts on my rouer @funilrys how hard is to add one more lists? Why so stubborn?
fyi - I have a couple of things to bring up. 1st, I'm going through the process of checking both the EasyList and EasyPrivacy blocklists (which are Adblock+ formatted) and they've been running all day so it may be a little while after but I can attach the outputs of each to this to see where there might be issues.
2nd- if we're talking about using PyFunceble to process an adblockplus formatted list for use as a hosts file, you'd only want to go after a subset of the domains. I assume PyFunceble currently tries to parse out all of the domains referenced in an ABP list for validation. But this would also capture what would then be false-positives if they were to apply to a hosts file.
pfBlockerNG (a pretty awesome package for pfSense firewall) also includes a feature where it parses out the hosts from both the EasyList & EasyPrivacy lists and adds them to a traditional DNS blocklist. I don't know if it might help to see the logic behind it- even though it's PHP and all.
I think I found the part of pfBlockerNG that gives an idea of what they extract from these sorts of lists to only capture domain names:
The short of it to me was to only process lines in ABP- formatted lists against any line that began with and immediately ended with ||
example.com^$third-party
Looking through an adblockplus syntax'd list, for something like PyFunceble, it seems like you'd want to ignore any lines with this junk: starts with:
!
@@
or contains:##
div stuff) aka element hiding$ xmlhttprequest, object, popup, fonts, redirect, popunder, generichide,
etc.)domain=
(which sounds like domains you wouldn't want to put into a hostsfile.)if it's a line that is just
||
something.domain.tld^$third-party,
no more, no less - process those.
Will be interested for your output @jawz101.
Between :
If it starts with !
we already ignore.
If it starts with @
we already ignore.
If it starts with [
we already ignore.
If it starts with /
we already ignore.
If href^=unilrys.github.io
is present we extract unilrys.github.io
for testing.
If it is in the format ||funilrys.github.io$script,image
we extract funilrys.github.io
for testing.
If it is in the format |github.io|
we extract github.io
for testing
If it is in the format ||twitter.com^helloworld.com
we extract twitter.com
testing
I'm conscious that it may be aggressive but I tried hard to comply with https://adblockplus.org/filter-cheatsheet along with the needs of "our field".
Indeed, for the case of href^=unilrys.github.io
if we write that, I do consider that we implicitly consider the referenced href as a bad one so we extract and test it.
What's your inputs on this short statement? :smile_cat:
Have a good night.
Cheers, Nissar
I drove to go get some gas and I was still thinking about it and felt I didn't have a complete idea. I was today, coincidentally, trying to do by hand with regex and Notepad++ what we're talking about so I figured I'd chime in :/
... to add to the ||example.com^$third-party I'd also want to block ||example.com^
What you said makes sense but is the goal to validate all domains in ABP rules or to also process them with the end result of a blocklist?
The reason I ask is I wouldn't want to block funilrys.github.io in this example because I wouldn't want to block the whole domain if an ABP rule was just trying to block certain bits of its content.
||funilrys.github.io$script,image
just depends on if the goal is to validate domains or also take them and then, say, make a pi-hole blocklist out of them. As is, it sounds like it would have a lot of false positives if I put the output into a blocklist file.
... right now I'm going through https://easylist.to/easylist/easylist.txt by hand and finding examples of the lines I try to exclude and then see what I'm left with...
attached is easyprivacy's list (I zipped up the cached files as well in case you want to set it up on a schedule like some others.)
ran with PyFunceble --adblock --link https://easylist.to/easylist/easyprivacy.txt
The EasyList is still on the K's... it's about x10 larger list than EasyPrivacy.
If the output looks good to you, @funilrys I want to send it on to the list maintainers. The EasyList one looks pretty red so I'm curious how it will turn out.
Thanks @jawz101 will look into that when I have a bit of time.
||example.com^
is already extracted as expected in the test : https://github.com/funilrys/PyFunceble/blob/4c3683225c4d63808a456bb35443c8d0b414ecfd/tests/test_core.py#L361
I get your point I did not thought about that little third-party option. Will implement :+1:
What about the other options @jawz101 ?
from https://adblockplus.org/filter-cheatsheet#options:
script~script | Include or exclude JavaScript files |
---|---|
image~image | Include or exclude image files |
stylesheet~stylesheet | Include or exclude stylesheets (CSS files) |
object~object | Include or exclude content handled by browser plugins like Flash or Java |
object-subrequest~object-subrequest | Include or exclude files loaded by browser plugins |
subdocument~subdocument | Include or exclude pages loaded within pages (frames) |
Exceptions document | Used to whitelist the page itself (e.g. @@||example.com^$document) elemhide | Used to prevent element rules from applying on a page (e.g. @@||example.com^$elemhide) Domains domain= | Specify a list of domains, separated by bar lines (|), on which a filter should be active. A filter may be prevented from being activated on a domain by preceding the domain name with a tilde (~). third-party~third-party | Specify whether a filter should be active on third-party or first domains Misc rewrite= | Specify a rewrite rule for the URL to be performed before downloading. If the filter is a regular expression, use $n to insert submatches into the rewritten URL. See JavaScript own String.prototype.replace().
Is extracting third-party
only sufficient?
Well, ||example.com^$third-party and ||example.com^ are what I ended up with I think
As for the rest of them, advanced syntax looks like it comes into play if you have conditions When I see things that would cause breakage.
fancy conditions:
||example.com^$image
only block images from example.com
||example.com^$third-party,script,object
only block it if it's 3rd party or if it's first party block its scripts and objects. Like, I might still need some of example.com 1st party stuff. In fact, if someone tried to process a uBlock list, gorhill actually made tons of additional things to block
||example.com^...elemhide
- make the network connection but just hide the resource (say, you may need to establish a connection to that subdomain to get some parts of the webpage but remove some of the banners and stuff it also wants to show)
||example.com^... domain=somesite.com
only block example.com if it's on somesite.com
less false positives:
||example.com^
block example.com. Basically, use ABP rules as if it were a DNS/hostfile-styled blocker
||example.com^$third-party
block example.com if it is third party. Even though it is a condition they don't seem to block legit sites you'd visit.
Adding the Easylist thing because it finished sometime last night. It's probably more valuable to you than the EasyPrivacy report because it includes a bunch of element hiding junk. It's because of this, pfBlocker doesn't actually use the famous EasyList itself in its processing and instead uses it's little brother called "EasyList no elem hiding" list found on this page since it removes a lot of the fancy conditional stuff and is more suited for strict rules that block the actual connections from occuring.
One thing I noticed with easylist is "if I was using PyFunceble to validate any domain it found in an ABP+ rule, it would be fine. if I was using this to process a list for use as a blocklist, I'd be screwed."
If you search the list of active hosts for google.com or github.com you will see that it checks those domains because they were somewhere in a rule. If I was to throw this into a blocklist it would not be great.
Hi @jawz101, I'm writing an improvement but we have to admit, it's impossible to avoid false positive. That's why whitelisting is more important than blocking.
Now we have to decide between 2 way:
For now, I'm implementing the second way but we may think about the other way in the future or as an extra option.
funilrys : but we may think about the other way in the future or as an extra option.
Yes, it would be the most usefull, especially for ad-block filters lists maintentainers to get rid of all dead domains.
funilrys : For now, I'm implementing the second way
The second way is also not bad as for the beginning, it still will extract many domains, also basically I would agree with this: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-426394422, but further summarizing, these are bad domains which can be converted to hosts:
||domain.com
||domain.com^
||domain.com$third-party
||domain.com^$third-party
the href
ones: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-426453334||domain.com^$3p
(shortened version of above's: $3p
= $third-party
)
||domain.com^$all
(all-in-1 combination of all options, excluding $important
)
||domain.com^$document
||domain.com^$important
(overrides whitelist filters)
and various combinations:
||domain.com^$3p,important
||domain.com^$important,3p
||domain.com^$all,important
||domain.com^$important,all
and variations without ^
as wellNotice: variations without ^
are very rare, they're mostly typos, but they still are valid filters
All other filters having anything additional to the above's should not be extraced, examples:
||twitter.com^helloworld.com
from : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443501130, ||funilrys.github.io$script,image
from: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390||domain.com^$domain=domain1.com|domain2.com
||domain.com^$third-party,image
As they don't block the whole domain (neither twitter.com
nor funilrys.github.io
nor domain1.com
, as they still can be visited), which means I agree with ( https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390 ) :
jawz101 : The reason I ask is I wouldn't want to block funilrys.github.io in this example because I wouldn't want to block the whole domain if an ABP rule was just trying to block certain bits of its content.
Of course ||domain.com^$3p
/ ||domain.com^$third-party
can be still visited as well, but they are mostly just ad and tracking servers.
Interesting @kulfoon,
Thanks for your feedback! I still chose to extract twitter.com
, funilrys.github.io
and domain.com
because of the use case described in #42.
I still added your examples to the tests and the current code passes it!
Thanks again for your feedback. Stay safe and healthy!
Thanks you too, however, it's getting more and more confusing.
funilrys: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-444576689
Since you decided to implement the second way, I think you should stop at extracting domains which should be only completely/almost completely blocked, like in my previous comment, what would also cover:
jawz101 : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390 : just depends on if the goal is to validate domains or also take them and then, say, make a pi-hole blocklist out of them.
By extracting anything more, like:
I still chose to extract
twitter.com
,funilrys.github.io
anddomain.com
because of the use case described in #42. +facebook.com##.search
from https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225
you will not cover the jawz101's above because causing false positives, just as he said:
jawz101 : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390 : As is, it sounds like it would have a lot of false positives if I put the output into a blocklist file.
also you are going beyond what it should be as for the second way, what will end up having neither the first way nor the second way and rather a some kind of a strange mix of both the first and the second way. So why not just to implement separately the first way method by simply extracting all domains, to cover all extraordinary domains, instead of partially extracting extraordinary domains into the second way method, also what sense is in extracting just a part of extraordinary domains. Alternatively, you could put all of your extraordinary domains into --agressive
switch https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526810118 .
funilrys : I still added your examples to the tests and the current code passes it!
I appreciate, but perhaps no need to add at least these: https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L120 https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L122
because such (or at least similiar) examples are already present in the tests: https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L89 https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L118
Greets.
Another failures:
Test filter | Extraction result |
---|---|
\|\|site1.com |
site1.com |
\|\|site2.com^ |
site2.com |
\|\|site3.com$ |
site3.com |
\|\|site4.com/ |
site4.com |
\|\|site5.com* |
site5.com* failure (artefact) |
\|\|site6.com$third-party |
site6.com |
\|\|site7.com^$third-party |
site7.com |
\|\|site8.com^$3p |
failure |
\|\|site9.com^$all |
site9.com |
\|\|site10.com^$document |
site10.com |
\|\|site11.com^$important |
failure |
\|\|site12.com^$3p,important |
site12.com |
\|\|site13.com^$important,3p |
failure |
\|\|site14.com^$all,important |
site14.com |
\|\|site15.com^$important,all |
site15.com |
\|\|site16.com^$doc |
failure |
\|\|site17.com^$document |
site17.com |
\|\|site18.com^$domain=site19.com |
site18.com, site19.com |
^adv^$domain=site20.com |
failure |
adv$domain=site21.com |
failure |
adv^$domain=site22.com |
failure |
As for the last 3 failures, many of such failures can be found in
https://easylist-downloads.adblockplus.org/easylistpolish.txt
The list contains about 2961 domains, but only 2459 are found by
Adblock Decoder (with --aggressive
option), which gives 83% efficiency.
funilrys : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-623357709 : I still chose to extract
twitter.com
,funilrys.github.io
anddomain.com
because of the use case described in https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225.funilrys : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-444576689 Now we have to decide between 2 way:
- We extract and test all possible domains. or
- We extract all domains which are (or may be) relevant.
domain=
, the domain should be removed then), I mean the case mentioned in https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225 applies to almost all domains and filters...I don't have time yet. @keczuppp but let me reopen this so that I can answer you when I get a bit of time.
For others who stumbles on this thread and wonders how they can solve this, there is a use example here: https://github.com/funilrys/PyFunceble/discussions/219
I've edited my comment, and added another failures.
Hmm what did you change @keczuppp ??
I didn't delete the history of changes, doesn't it work for you?
Anyway, I added last 4 rows of the table and the description + spoiler.
None the less, thanks for your reply
As reported by @dnmTX at https://github.com/Ultimate-Hosts-Blacklist/dev-center/issues/9:
are ignored.