Adblock decoder ignore some portion when decoding

funilrys commented 6 years ago

As reported by @dnmTX at https://github.com/Ultimate-Hosts-Blacklist/dev-center/issues/9:

everything with ##[href^=....

are ignored.

dnmTX commented 6 years ago

Caught another bug: This section in the original lists are rules that removing elements(only) from legit sites:

After PyFunceble filter the lists the end result in domain.list is: end_result Notice how legit sites are being blocked?

funilrys commented 6 years ago

Okay you have to explain me AdBlock then @dnmTX :smile_cat: I'm not a big fan of it as its syntax is confusing.

So how do I differ legit from bad site in adblock ? I though that adblock was only about blocking not whitelisting :thinking:

dnmTX commented 6 years ago

OK...... i'll do the basics only to be more clear: If you want to block domain you need to add || in the front and ^ at the end(it will catch the subdomains as well) If you want to block just element in that website,you need to find it(chrome dev-tools helps a lot with that) and add ## after the domain name.Example: Open yahoo.com(i removed soooo many elemnts from that page you wouldn't recognize it). Now...look at your yahoo page and compare to mine: yahoo

Much cleaner,no videos,no annoyances. Rules examples: yahoo.com###applet_p_50000278 yahoo.com###applet_p_32209491
yahoo.com###applet_p_50000277 yahoo.com###applet_p_63802 yahoo.com###applet_p_63796 yahoo.com###sticky-lrec2-footer

funilrys commented 6 years ago

Okay so what about this format ? Which of the following mark the domain as a bad or good boy ?

||google.com$script,image
||api.google.com/papi/action$popup
facebook.com###player-above-2
~github.com,hello.world##
@@||cnn.com/*ad.xml
!||world.hello/*ad.xml
!@@||funceble.world/js
yahoo.com,msn.com,api.hello.world#@#awesomeWorld
!funilrys.com##body
hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld

I know you will not find them in real world but they are part of the tests for the decoder.

dnmTX commented 6 years ago

The ##[href^=.... it's different.it's embedded in the iframe and this how you blocking those domains

funilrys commented 6 years ago

Okay I'm working on that implementation.

So in this

hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld

they are all legit right ?

dnmTX commented 6 years ago

||google.com$script,image -this rule will not allow any scripts or images to be shown or executed on that domain ||api.google.com/papi/action$popup -this rule will stop the popup coming from that link facebook.com###player-above-2 -this one will hide element(looks like a video player) on that page ~github.com,hello.world## -hmmmm haven't seen this one @@||cnn.com/ad.xml -this rule will whitelist that link on the webpage (@@ in front is whitelisting) !||world.hello/ad.xml -this will block it(! in the front is comment) !@@||funceble.world/js -this will whitelist that js script (! in the front is comment) yahoo.com,msn.com,api.hello.world#@#awesomeWorld -don't know !funilrys.com##body -this will block element hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld -don't know

dnmTX commented 6 years ago

Stay put,let me do some research on #@# rule cause i'm using AdGuard and haven't seen such a rule there

dnmTX commented 6 years ago

Ok... in the above example the rule #@# allows(whitelists) that particular element on the listed domains,so yes,all those domains are legit

funilrys commented 6 years ago

Okay let me implement this issue first with the current format will then review with you for all tests as those need some hotfix. Never thought about whitelisting :joy_cat:

dnmTX commented 6 years ago

I know,it's Java,more complex.Took me a while to get around it but i'm getting there

dnmTX commented 6 years ago

@funilrys make it simple.Everything that has || in front and ^ at the end stays. Everything that has href in it stays(filtered of course to leave the domain only).The rest should be removed as it's rules that don't really concern any of us who will use the lists in hosts format.

funilrys commented 6 years ago

Yeah but if I do that, I'll invalidate AdBlock/filter list like https://github.com/MajkiIT/polish-ads-filter :smile_cat:

funilrys commented 6 years ago

Only need to take some time to understand how it works properly then will clean the mess I created!

dnmTX commented 6 years ago

Look at this one for example.In it,all legit domains with rules to block certain elements only.

dnmTX commented 6 years ago

Ok,i know it will take time but meanwhile,for everyone who uses the lists with dnsmasq etc etc and not adblockers. Can you PLEASE add https://raw.githubusercontent.com/Dawsey21/Lists/master/main-blacklist.txt to be filtered properly.

dnmTX commented 6 years ago

Also you can start here,it's very well explained and will help you understand the basics: https://kb.adguard.com/en/general/how-to-create-your-own-ad-filters

dnmTX commented 6 years ago

Ok,i know it will take time but meanwhile,for everyone who uses the lists with dnsmasq etc etc and not adblockers. Can you PLEASE add https://raw.githubusercontent.com/Dawsey21/Lists/master/main-blacklist.txt to be filtered properly.

PLEASE
untitled

funilrys commented 6 years ago

@dnmTX ,

PyFunceble is fixed, please look at the tests for details.

As you mentioned, there was really an issue with my way of handling adblock lists. Therefor here is the eratum:

Please understand by self.expected the list of extracted domains from the given input (self.lines).

self.lines = [
            "||funilrys.github.io$script,image",
            "||google.com^$script,image",
            "||twitter.com^helloworld.com",
            "||api.google.com/papi/action$popup",
            "facebook.com###player-above-2",
            "~github.com,hello.world##.wrapper",
            "@@||cnn.com/*ad.xml",
            "!||world.hello/*ad.xml",
            "bing.com,bingo.com#@##adBanner",
            "!@@||funceble.world/js",
            "yahoo.com,~msn.com,api.hello.world#@#awesomeWorld",
            "!funilrys.com##body",
            "hello#@#badads",
            "hubgit.com|oohay.com|ipa.elloh.dlorw#@#awesomeWorld",
            '##[href^="https://funceble.funilrys.com/"]',
            "[AdBlock Plus 2.0]",
            '##div[href^="http://funilrys.com/"]',
            'com##[href^="ftp://funceble.funilrys-funceble.com/"]',
            "/banner/*/img^" "|github.io|",
            "|github.io|",
            "||api.funilrys.com/widget/$",
        ]

        self.expected = [
            "funilrys.github.io",
            "google.com",
            "twitter.com",
            "api.google.com",
            "funceble.funilrys.com",
            "funilrys.com",
            "github.io",
            "api.funilrys.com",
]

As the tests were passed without any issue (cf.) I can attest that the next release and the current development version do not take any false positive anymore.

Please let me know if there is something else.

This issue will be closed on next release!

Cheers, Nissar

funilrys commented 6 years ago

Also my tests should now be compliant with https://adblockplus.org/filter-cheatsheet

dnmTX commented 6 years ago

@funilrys from what i can tell and understand is self.lines is the example of if there is any domains there not to be added for filtering as they are legit? Am i close?

What about anything with ##div[href^=...,those are usually bad ones that need blocking?

Another thing(just to make sure).Example: ||api.funilrys.com/widget/$ what this is is partial link that could be api.funilrys.com/widget/bla/bla/bla/ad.js and the adblocker will catch it but the thing is that because that domain is hosting some ad or telemetry script(google usually does that) that doesn't mean that the actual domain is bad.The question is if there is certain rule how that domain will be considered,as bad or as good?

funilrys commented 6 years ago

@dnmTX self.lines contains random lines that can be found in regular AdBlock. The objective of the code I write/wrote is to get as output the list self.expected which is in more practical way, what we are going to test (the bad ones).

So from your point of view self.expected represent the bad one we have to test.

About ##div[href^=... It's there because usually you have ##[href^=... but those variant also exist:

##div[href^=...
com##div[href^=...
com##[href^=...

With my review, the domain which is in the href attribute is extracted and formatted (remove protocol and "decorators") :smile_cat:

dnmTX commented 6 years ago

Actually from my point of view the self.expected should be considered the good ones with exception of everything that has href in it. Bad ones should start with || and end with ^ including all the href variations.

funilrys commented 6 years ago

Wow you lost me :joy_cat:

For clarification, those are example of format do not consider those domains we are only talking about extracted domain from matched format :smile_cat:

|  Expected/Extracted/Tested by PyFunceble  | Line (example)                                |
|------------------------------------------ |--------------------------------------------   |
| funilrys.github.io                        | ||funilrys.github.io$script,image             |
| google.com                                | ||google.com^$script,image                    |
| twitter.com                               | ||twitter.com^helloworld.com                  |
| api.google.com                            | ||api.google.com/papi/action$popup            |
| funceble.funilrys.com                     | ##[href^="https://funceble.funilrys.com/"]    |
| funilrys.com                              | ##div[href^="http://funilrys.com/"]           |
| github.io                                 | |github.io|                                   |
| api.funilrys.com                          | ||api.funilrys.com/widget/$                   |

Also if we match for example hello.world##ad-selector we do not extract hello.world as a bad one.

funilrys commented 6 years ago

Maybe I misunderstood something :thinking:

dnmTX commented 6 years ago

Also if we match for example hello.world##ad-selector we do not extract hello.world as a bad one.

Ok,that's good,that's how it's suppose to be but...... If we match ||api.google.com/papi/action$popup do we extract api.google.com as bad one or not? This is where the tricky part is,cause in this example api.google.com is very legit domain that hosts ad scripts and so on but also hosts things that without them the web page will be broken.

funilrys commented 6 years ago

If we have ||api.google.com/papi/action$popup or for example ||api.example.org/pap/hello$popup the system will extract, test and produce result respectively for api.google.com and api.example.org.

dnmTX commented 6 years ago

basically you saying that api.google.com will be blocked?

What if you have let's say: ||yahoo.com/papi/action$popup so the end result will be: 0.0.0.0 yahoo.com in ACTIVE folder.

funilrys commented 6 years ago

You're thinking about after.

PyFunceble work with the data you provide. Which means that if you decide to have api.google.com in your list, PyFunceble will test it. If you decide to have api.google.com in the hosts file to test, PyFunceble will test it. If you decide to provide ||api.google.com/papi/action$popup into your adblock list, PyFunceble will extract and test it. PyFunceble is a global tool which does what he was told to: Check the availability of a given domain, IPv4 or URL.

What you do with the results and data is what you want. That's why there is whitelist in project like Ultimate. Because we all know that false positive will always be there in such big compilation. And we are not talking about maintainers who block for example google.com. It's their list, our tests, our compilation but we still have to deal with whitelisting because the upstream maintainer may not want to whitelist x or y even if they are legit and not harmful.

funilrys commented 6 years ago

By the way if you're looking for a whitelisting script we have one at https://github.com/Ultimate-Hosts-Blacklist/dev-center/tree/whitelisting :smile_cat:

dnmTX commented 6 years ago

Ok,i see now.You were more concerned about how to properly extract the domains,i was worry more about false positives(as it's understandable cause i'm the end user).Basically for me not to worry about that cause i'm really trying to automate everything and not even think about it can't you PLEASE(again) add the other LISTS and let's be done with it.

P.S. I can't use Python scripts on my rouer @funilrys how hard is to add one more lists? Why so stubborn?

jawz101 commented 6 years ago

fyi - I have a couple of things to bring up. 1st, I'm going through the process of checking both the EasyList and EasyPrivacy blocklists (which are Adblock+ formatted) and they've been running all day so it may be a little while after but I can attach the outputs of each to this to see where there might be issues.

2nd- if we're talking about using PyFunceble to process an adblockplus formatted list for use as a hosts file, you'd only want to go after a subset of the domains. I assume PyFunceble currently tries to parse out all of the domains referenced in an ABP list for validation. But this would also capture what would then be false-positives if they were to apply to a hosts file.

pfBlockerNG (a pretty awesome package for pfSense firewall) also includes a feature where it parses out the hosts from both the EasyList & EasyPrivacy lists and adds them to a traditional DNS blocklist. I don't know if it might help to see the logic behind it- even though it's PHP and all.

I think I found the part of pfBlockerNG that gives an idea of what they extract from these sorts of lists to only capture domain names:

https://github.com/pfsense/FreeBSD-ports/blob/devel/net/pfSense-pkg-pfBlockerNG-devel/files/usr/local/pkg/pfblockerng/pfblockerng.inc#L5687

The short of it to me was to only process lines in ABP- formatted lists against any line that began with and immediately ended with ||example.com^$third-party

Looking through an adblockplus syntax'd list, for something like PyFunceble, it seems like you'd want to ignore any lines with this junk: starts with:

commented !
whitelisted @@ or contains:
block specific parts of pages (the ## div stuff) aka element hiding
specific web technologies $ xmlhttprequest, object, popup, fonts, redirect, popunder, generichide, etc.)
or certain domains only if it's seen or not seen on another certain domain domain= (which sounds like domains you wouldn't want to put into a hostsfile.)

if it's a line that is just

||something.domain.tld^$third-party,

no more, no less - process those.

funilrys commented 6 years ago

Will be interested for your output @jawz101.

Between :

If it starts with ! we already ignore.
If it starts with @ we already ignore.
If it starts with [ we already ignore.
If it starts with / we already ignore.
If href^=unilrys.github.io is present we extract unilrys.github.io for testing.
If it is in the format ||funilrys.github.io$script,image we extract funilrys.github.io for testing.
If it is in the format |github.io| we extract github.io for testing
If it is in the format ||twitter.com^helloworld.com we extract twitter.com testing

I'm conscious that it may be aggressive but I tried hard to comply with https://adblockplus.org/filter-cheatsheet along with the needs of "our field". Indeed, for the case of href^=unilrys.github.io if we write that, I do consider that we implicitly consider the referenced href as a bad one so we extract and test it.

What's your inputs on this short statement? :smile_cat:

Have a good night.

Cheers, Nissar

jawz101 commented 6 years ago

I drove to go get some gas and I was still thinking about it and felt I didn't have a complete idea. I was today, coincidentally, trying to do by hand with regex and Notepad++ what we're talking about so I figured I'd chime in :/

... to add to the ||example.com^$third-party I'd also want to block ||example.com^

What you said makes sense but is the goal to validate all domains in ABP rules or to also process them with the end result of a blocklist?

The reason I ask is I wouldn't want to block funilrys.github.io in this example because I wouldn't want to block the whole domain if an ABP rule was just trying to block certain bits of its content.

||funilrys.github.io$script,image

just depends on if the goal is to validate domains or also take them and then, say, make a pi-hole blocklist out of them. As is, it sounds like it would have a lot of false positives if I put the output into a blocklist file.

jawz101 commented 6 years ago

... right now I'm going through https://easylist.to/easylist/easylist.txt by hand and finding examples of the lines I try to exclude and then see what I'm left with...

jawz101 commented 6 years ago

attached is easyprivacy's list (I zipped up the cached files as well in case you want to set it up on a schedule like some others.) ran with PyFunceble --adblock --link https://easylist.to/easylist/easyprivacy.txt

easyprivacy.zip

The EasyList is still on the K's... it's about x10 larger list than EasyPrivacy.

If the output looks good to you, @funilrys I want to send it on to the list maintainers. The EasyList one looks pretty red so I'm curious how it will turn out.

funilrys commented 6 years ago

Thanks @jawz101 will look into that when I have a bit of time.

||example.com^

is already extracted as expected in the test : https://github.com/funilrys/PyFunceble/blob/4c3683225c4d63808a456bb35443c8d0b414ecfd/tests/test_core.py#L361

I get your point I did not thought about that little third-party option. Will implement :+1:

funilrys commented 6 years ago

What about the other options @jawz101 ?

from https://adblockplus.org/filter-cheatsheet#options:

script~script	Include or exclude JavaScript files
image~image	Include or exclude image files
stylesheet~stylesheet	Include or exclude stylesheets (CSS files)
object~object	Include or exclude content handled by browser plugins like Flash or Java
object-subrequest~object-subrequest	Include or exclude files loaded by browser plugins
subdocument~subdocument	Include or exclude pages loaded within pages (frames)

Exceptions document | Used to whitelist the page itself (e.g. @@||example.com^$document) elemhide | Used to prevent element rules from applying on a page (e.g. @@||example.com^$elemhide) Domains domain= | Specify a list of domains, separated by bar lines (|), on which a filter should be active. A filter may be prevented from being activated on a domain by preceding the domain name with a tilde (~). third-party~third-party | Specify whether a filter should be active on third-party or first domains Misc rewrite= | Specify a rewrite rule for the URL to be performed before downloading. If the filter is a regular expression, use $n to insert submatches into the rewritten URL. See JavaScript own String.prototype.replace().

Is extracting third-party only sufficient?

jawz101 commented 6 years ago

Well, ||example.com^$third-party and ||example.com^ are what I ended up with I think

As for the rest of them, advanced syntax looks like it comes into play if you have conditions When I see things that would cause breakage.

fancy conditions:

||example.com^$image only block images from example.com

||example.com^$third-party,script,object only block it if it's 3rd party or if it's first party block its scripts and objects. Like, I might still need some of example.com 1st party stuff. In fact, if someone tried to process a uBlock list, gorhill actually made tons of additional things to block

||example.com^...elemhide - make the network connection but just hide the resource (say, you may need to establish a connection to that subdomain to get some parts of the webpage but remove some of the banners and stuff it also wants to show)

||example.com^... domain=somesite.com only block example.com if it's on somesite.com

less false positives:

||example.com^ block example.com. Basically, use ABP rules as if it were a DNS/hostfile-styled blocker

||example.com^$third-partyblock example.com if it is third party. Even though it is a condition they don't seem to block legit sites you'd visit.

Adding the Easylist thing because it finished sometime last night. It's probably more valuable to you than the EasyPrivacy report because it includes a bunch of element hiding junk. It's because of this, pfBlocker doesn't actually use the famous EasyList itself in its processing and instead uses it's little brother called "EasyList no elem hiding" list found on this page since it removes a lot of the fancy conditional stuff and is more suited for strict rules that block the actual connections from occuring.

easylist.zip

One thing I noticed with easylist is "if I was using PyFunceble to validate any domain it found in an ABP+ rule, it would be fine. if I was using this to process a list for use as a blocklist, I'd be screwed."

If you search the list of active hosts for google.com or github.com you will see that it checks those domains because they were somewhere in a rule. If I was to throw this into a blocklist it would not be great.

funilrys commented 5 years ago

Hi @jawz101, I'm writing an improvement but we have to admit, it's impossible to avoid false positive. That's why whitelisting is more important than blocking.

Now we have to decide between 2 way:

We extract and test all possible domains. or
We extract all domains which are (or may be) relevant.

For now, I'm implementing the second way but we may think about the other way in the future or as an extra option.

kulfoon commented 4 years ago

funilrys : but we may think about the other way in the future or as an extra option.

Yes, it would be the most usefull, especially for ad-block filters lists maintentainers to get rid of all dead domains.

funilrys : For now, I'm implementing the second way

The second way is also not bad as for the beginning, it still will extract many domains, also basically I would agree with this: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-426394422, but further summarizing, these are bad domains which can be converted to hosts:

All adblockers: ||domain.com ||domain.com^ ||domain.com$third-party ||domain.com^$third-party the href ones: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-426453334
uBO specific: ||domain.com^$3p (shortened version of above's: $3p= $third-party) ||domain.com^$all (all-in-1 combination of all options, excluding $important) ||domain.com^$document ||domain.com^$important (overrides whitelist filters) and various combinations: ||domain.com^$3p,important ||domain.com^$important,3p ||domain.com^$all,important ||domain.com^$important,all and variations without ^ as well
AdGuard specific: I don't use AdGuard (still it's a good adblocker, I'm just sticked to uBO + ND for years)

Notice: variations without ^ are very rare, they're mostly typos, but they still are valid filters

All other filters having anything additional to the above's should not be extraced, examples:

||twitter.com^helloworld.com from : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443501130,
||funilrys.github.io$script,image from: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390
||domain.com^$domain=domain1.com|domain2.com
||domain.com^$third-party,image

As they don't block the whole domain (neither twitter.com nor funilrys.github.io nor domain1.com , as they still can be visited), which means I agree with ( https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390 ) :

jawz101 : The reason I ask is I wouldn't want to block funilrys.github.io in this example because I wouldn't want to block the whole domain if an ABP rule was just trying to block certain bits of its content.

Of course ||domain.com^$3p / ||domain.com^$third-party can be still visited as well, but they are mostly just ad and tracking servers.

funilrys commented 4 years ago

Interesting @kulfoon,

Thanks for your feedback! I still chose to extract twitter.com, funilrys.github.io and domain.com because of the use case described in #42.

I still added your examples to the tests and the current code passes it!

Thanks again for your feedback. Stay safe and healthy!

kulfoon commented 4 years ago

Thanks you too, however, it's getting more and more confusing.

funilrys: https://github.com/funilrys/PyFunceble/issues/13#issuecomment-444576689

Since you decided to implement the second way, I think you should stop at extracting domains which should be only completely/almost completely blocked, like in my previous comment, what would also cover:

jawz101 : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390 : just depends on if the goal is to validate domains or also take them and then, say, make a pi-hole blocklist out of them.

By extracting anything more, like:

I still chose to extract twitter.com, funilrys.github.io and domain.com because of the use case described in #42. + facebook.com##.search from https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225

you will not cover the jawz101's above because causing false positives, just as he said:

jawz101 : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-443475390 : As is, it sounds like it would have a lot of false positives if I put the output into a blocklist file.

also you are going beyond what it should be as for the second way, what will end up having neither the first way nor the second way and rather a some kind of a strange mix of both the first and the second way. So why not just to implement separately the first way method by simply extracting all domains, to cover all extraordinary domains, instead of partially extracting extraordinary domains into the second way method, also what sense is in extracting just a part of extraordinary domains. Alternatively, you could put all of your extraordinary domains into --agressive switch https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526810118 .

funilrys : I still added your examples to the tests and the current code passes it!

I appreciate, but perhaps no need to add at least these: https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L120 https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L122

because such (or at least similiar) examples are already present in the tests: https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L89 https://github.com/funilrys/PyFunceble/blob/083a4ccc601cb05799a85676901c5e2b4d8a4249/tests/test_converter_adblock.py#L118

Greets.

keczuppp commented 3 years ago

1

Another failures:

Test filter	Extraction result
`\\|\\|site1.com`	site1.com
`\\|\\|site2.com^`	site2.com
`\\|\\|site3.com$`	site3.com
`\\|\\|site4.com/`	site4.com
`\\|\\|site5.com*`	*site5.com failure (artefact)**
`\\|\\|site6.com$third-party`	site6.com
`\\|\\|site7.com^$third-party`	site7.com
`\\|\\|site8.com^$3p`	failure
`\\|\\|site9.com^$all`	site9.com
`\\|\\|site10.com^$document`	site10.com
`\\|\\|site11.com^$important`	failure
`\\|\\|site12.com^$3p,important`	site12.com
`\\|\\|site13.com^$important,3p`	failure
`\\|\\|site14.com^$all,important`	site14.com
`\\|\\|site15.com^$important,all`	site15.com
`\\|\\|site16.com^$doc`	failure
`\\|\\|site17.com^$document`	site17.com
`\\|\\|site18.com^$domain=site19.com`	site18.com, site19.com
`^adv^$domain=site20.com`	failure
`adv$domain=site21.com`	failure
`adv^$domain=site22.com`	failure

As for the last 3 failures, many of such failures can be found in https://easylist-downloads.adblockplus.org/easylistpolish.txt The list contains about 2961 domains, but only 2459 are found by Adblock Decoder (with --aggressive option), which gives 83% efficiency.

failures

2

funilrys : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-623357709 : I still chose to extract twitter.com, funilrys.github.io and domain.com because of the use case described in https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225.

funilrys : https://github.com/funilrys/PyFunceble/issues/13#issuecomment-444576689 Now we have to decide between 2 way:

We extract and test all possible domains. or

We extract all domains which are (or may be) relevant.

could you clairfy the both ways more specific... because as far I see almost all domains are revelant from a point of view of an adblock filter list described in https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225 so almost all domains should be extracted from all filters: almost any filter sticked to a dead domain should be removed from an adblocker list, (or if the domain is sticked to domain=, the domain should be removed then), I mean the case mentioned in https://github.com/funilrys/PyFunceble/issues/42#issuecomment-526795225 applies to almost all domains and filters...
so best would be to limit the 1st way of extraction to domains related / useful for hosts files only, and the second way should extract all domains from all filters which would be related / useful for adblocker lists to clean the lists, currently I see a weird mix of both...

funilrys commented 3 years ago

I don't have time yet. @keczuppp but let me reopen this so that I can answer you when I get a bit of time.

spirillen commented 3 years ago

For others who stumbles on this thread and wonders how they can solve this, there is a use example here: https://github.com/funilrys/PyFunceble/discussions/219

keczuppp commented 3 years ago

I've edited my comment, and added another failures.

spirillen commented 3 years ago

Hmm what did you change @keczuppp ??

keczuppp commented 3 years ago

I didn't delete the history of changes, doesn't it work for you?

spoiler

![1](https://user-images.githubusercontent.com/74409207/110595859-e0a98300-817e-11eb-9a60-e52ed5f6fab5.png)

Anyway, I added last 4 rows of the table and the description + spoiler.

spirillen commented 3 years ago

OT response to keczuppp

> I didn't delete the history of changes, doesn't it work for you? Yes it does, however it is not obviously what was changed :wink: GH could do this better, or just do as they always have done, steel other idea's :smirk: As an example from ![image](https://user-images.githubusercontent.com/44526987/110630241-6c360a80-81a5-11eb-9abb-86341f10dec0.png) As you can see, it highlights the changes (red for deleted)

None the less, thanks for your reply

funilrys / PyFunceble

Adblock decoder ignore some portion when decoding #13

1

2