TechnikEmpire / DistillNET

DistillNET is a library for matching and filtering HTTP requests and HTML response content using the Adblock Plus Filter format.
Mozilla Public License 2.0
15 stars 4 forks source link

Filtering page resources #21

Closed igvk closed 5 years ago

igvk commented 5 years ago

How to use the library for the following example? ||ads.example.com^$domain=example.com|~foo.example.com (taken from adblockplus.org filters explained)

The page that is loaded is http://example.com or http://subdomain.example.com. The banner url is, for example, http://ads.example.com/foo.gif.

I try to do something like this (abridged):

Uri url;
Uri.TryCreate(request.Url, UriKind.Absolute, out url); // http://ads.example.com/foo.gif
var filters = allFilters.GetFiltersForDomain(pageUrl.Host); // example.com
foreach (var filter in filters)
{
    if (filter.IsMatch(url, headers))
       return true; // match
}
return false; // no match

But it seems that this isn't how it's intended to be. Am I wrong?

TechnikEmpire commented 5 years ago

At a quick glance I don't see why that wouldn't work. Double pipe is an anchor domain which means that exact subdomain must be the full how of the URL for the rest of the rule to work.

Obviously we can't test this on example.com cause its not real. If you find a real example that doesn't seem to work right, please let me know. Maybe I misunderstood the question.

TechnikEmpire commented 5 years ago

The adblock plus syntax can be quite confusing. It took me a very long time of study, trial and error before I was able to write this library. Basically, the rule you've given as an example is like this:

|| <---- This means that the domain name that follows must be the EXACT value that uri.Host property is.

So, uri.Host in this case can only possibly be ads.example.com. This rule will not match any other domain.

The next aspect of your rule is ^. This means that there MUST be a separator. If the whole URI is only http://ads.example.com, this rule will not match. There must be something more, a valid separator character, so http://ads.example.com/ WILL match this rule.

The next portion is $domain=, so this rule will be stored for example.com. When you visit example.com, and call GetFiltersForDomain("example.com"), this rule will be loaded. The domain parameter binds the rule to limit it so it only functions/activates on a specific domain.

The tilde ~ operator on most options INVERTS the function of the rule. If I recall correctly, ads.example.com will be matched if you load the rule with GetWhitelistFiltersForDomain("foo.example.com") and then match but. But in this case, it's an exception, so you don't want to block that resource.

I know, it can get really complex. Let me know if you have any further questions.

igvk commented 5 years ago

I tried to find some real examples from the current list.

Here is the whitelisting one: @@||www.google.com/adsense/search/async-ads.js$domain=webcrawler.com If I call GetWhitelistFiltersForDomain("www.google.com"), I still get this filter in the enumeration, although it definitely is relevant only for the domain webcrawler.com.

The same is true for the filter ||google.com/jsapi?autoload=*%22ads%22$script,domain=youtube.com and the call to GetFiltersForDomain("www.google.com").

TechnikEmpire commented 5 years ago

Ah ok. I think some of my logic got a little screwy when I added in referrers. I'll look at this.

Note that I don't completely conform and function like adblock plus. For example, you should strip leading www. off of domains when making rules. It's superfluous so I intentionally coded to ignore this. My primary goal was speed, so add in code to work around such things was against the nature of the library.

igvk commented 5 years ago

Well, i tried to give your library the original easylist (except for comment lines that it doesn't understand). Interesting that it mostly works in the case of www.google.com. I didn't know about incompatibilities, although I saw ignoring www prefix in matching code. Though I suppose the main slow part is the lines like that in calling code:

foreach (var filter in filters)
{
    if (filter.IsMatch(url, headers))
       return true; // match
}
TechnikEmpire commented 5 years ago

Look at the benchmarks in the wiki. This library is incredibly fast specifically because my priority was speed. I should document stripping the www. stuff though for sure.

igvk commented 5 years ago

Just to be clear - I am not doubting the speed of the library itself, that's great that it's fast. I was only commenting on the way that I use foreach in my code to check all the global filters for a given resource url. Looking forward to you fixing the issue with the domains. Thanks for the great work!