TechnikEmpire / DistillNET

DistillNET is a library for matching and filtering HTTP requests and HTML response content using the Adblock Plus Filter format.
Mozilla Public License 2.0
16 stars 4 forks source link

Every tested domain is always blocked #20

Closed Sergio0694 closed 5 years ago

Sergio0694 commented 5 years ago

Hello, I'm trying to use this lib to filter HTTP requests in my application, but for some reason I can't get this work as expected, as virtually any Uri I try to pass results in at least a rule blocking it. Here's the repro code I'm using:

using (var client = new HttpClient())
{
    // Get the easylist rules
    var easyList = await client.GetStringAsync("https://easylist.to/easylist/easylist.txt");
    var lines = easyList.Split('\n');
    var rules = lines.SkipWhile(line => line.StartsWith('[') || line.StartsWith('!')).ToArray();

    // Get the URL filters
    var parser = new AbpFormatRuleParser();
    var filters = rules.Select(rule => parser.ParseAbpFormattedRule(rule, 1));
    var urlFilters = filters.OfType<UrlFilter>().ToArray();

    // Test a domain
    var uri = new Uri("https://www.repubblica.it");
    var headers = new NameValueCollection(StringComparer.InvariantCultureIgnoreCase)
    {
        ["Accept"] = "text/html, application/xhtml+xml, application/xml; q=0.9, */*; q=0.8",
        ["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; WOW64; WebView/3.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763",
        ["Upgrade-Insecure-Requests"] = "1",
        ["Accept-Language"] = "it-IT, it; q=0.8, en-US; q=0.5, en; q=0.3"
    };
    if (urlFilters.FirstOrDefault(filter => filter.IsMatch(uri, headers)) is UrlFilter blocker)
    {
        Console.WriteLine($"BLOCKED: {blocker.OriginalRule}");
    }
    else Console.WriteLine("OK");
}

Running this script results in the domain being blocked by this rule:

$webrtc,domain=hdmoza.com

Now, that rule shouldn't actually match with the given URL in this case.

EDIT: noticed that this rule in particular was always returning true, because the Parts property was empty. Tried filtering out all the rules with that property being empty, but the domain was still always blocked because of some other rules always matching (incorrectly).

EDIT 2: also tried using the easylist.txt file embedded in this repo (thinking maybe the remote one I was using had been updated with some new, unsupported rules), but the issue is still there as well.

Am I missing something here? I might as well be making a very obvious mistake, but I really can't figure out what am I doing wrong.

Thank you for your help, keep up the good work! 😄 Sergio

TechnikEmpire commented 5 years ago

webrtc is unsupported. Most things that require a browser context are. Thanks for reporting though because this should not be matching. I'll look into it. Any other examples that don't use unsupported rules?

Sergio0694 commented 5 years ago

Hi @TechnikEmpire - thanks for the quick reply! I've made a quick test with the online easylist.txt file and the same URL and headers in the code I posted here, here are all the rules that (incorrectly) match with the URL:

$webrtc,domain=hdmoza.com
$webrtc,domain=itdmusic.com|itdmusic.in
$media,domain=damimage.com|imagedecode.com|imageteam.org
$websocket,domain=povw1deo.com|povwideo.net|powvideo.net
$webrtc,domain=uptobox.com
$websocket,domain=flash-x.tv|flashsx.tv|flashx.bz|flashx.cc|flashx.co|flashx.me|flashx.run|flashx.sx|flashx.to|flashx.tv|flashx.ws|flashx1.tv|flashxx.tv
$webrtc,domain=m4ufree.com|m4ufree.tv
$webrtc,domain=101greatgoals.com|123movies.net|4chan.org|9anime.to|9anime.vip|ack.net|allkpop.com|allthetests.com|alltube.tv|audioholics.com|barnstablepatriot.com|bibme.org|blacklistednews.com|boards2go.com|bolde.com|britannica.com|btdb.to|businessnewsdaily.com|buzzfil.net|cantonrep.com|capecodtimes.com|champion.gg|cheeseheadtv.com|clicknupload.link|clicknupload.org|closerweekly.com|collegehumor.com|colourlovers.com|columbiatribune.com|connectedly.com|convertcase.net|crackberry.com|csgolounge.com|ddlvalley.me|destructoid.com|diffen.com|dispatch.com|dorkly.com|dota2lounge.com|dwatchseries.to|enterprisenews.com|eztv.ag|eztv.io|fastplay.to|fayobserver.com|fbstreams.me|fhm.com|firstforwomen.com|flexonline.com|freewarefiles.com|gastongazette.com|geekzone.co.nz|genfb.com|ghacks.net|go4up.com|goerie.com|gorillavid.in|gounlimited.to|goupstate.com|grammarist.com|gsmarena.com|hdfilme.tv|hdvid.life|hdvid.tv|hdvid.xyz|healthline.com|heraldtribune.com|houmatoday.com|icefilms.info|igg-games.com|intouchweekly.com|investopedia.com|j-14.com|jpost.com|kingvid.tv|kiplinger.com|kshowonline.com|laptopmag.com|lifeandstylemag.com|lolcounter.com|m-magazine.com|mac-torrents.com|madamenoire.com|maketecheasier.com|megaup.net|mensfitness.com|merriam-webster.com|metrowestdailynews.com|mlbstream.me|movies123.xyz|muscleandfitness.com|myfeed4u.me|myfeed4u.net|nbastreams.me|netdna-storage.com|newpct.com|news-journalonline.com|newsarama.com|nflstreams.me|nhlstreams.me|nintendoeverything.com|nowvideo.club|nowvideo.pw|nwfdailynews.com|nydailynews.com|oloadcdn.net|onvid.club|onvid.xyz|openload.co|openload.pw|otakustream.tv|ouo.io|ouo.press|ourl.io|pelispedia.tv|phonearena.com|pjstar.com|playbill.com|probuilds.net|providencejournal.com|radaronline.com|rapidvideo.com|readmanga.today|recordonline.com|reservedoffers.club|salefiles.com|sj-r.com|skidrowcrack.com|skidrowreloaded.com|soapoperadigest.com|solomid.net|sourceforge.net|space.com|spanishdict.com|sportshd.me|streamfilmzzz.com|streamzzz.online|teamliquid.net|telegram.com|teslacentral.com|the123movies.org|the4thofficial.net|theberry.com|thechive.com|thepoliticalinsider.com|thevideobee.to|tmn.today|tomsguide.com|tomshardware.co.uk|tomshardware.com|topix.com|torrentfunk.com|uploading.site|uptobox.com|uticaod.com|vidtodo.com|vidzi.online|vidzi.si|vidzi.tv|vrheads.com|vvdailypress.com|watchtvserieslive.org|wikia.com|womansworld.com|xda-developers.com|xdrive.cc|yts.am|yts.gs
$websocket,domain=123movies-proxy.ru|123movies.cz|123movies.gs|123movies.is|123movies.live|123movies.net|123movies.net.ru|123movies.ru|123movies.vc|123movieshd.net|123movieshd.to|1337x.st|1337x.to|2giga.link|4archive.org|4chan.org|9anime.to|9anime.vip|allthetests.com|alltube.tv|batmanstream.com|bibme.org|blacklistednews.com|boards2go.com|boreburn.com|breakingisraelnews.com|btdb.in|celebdirtylaundry.com|celebritymozo.com|closerweekly.com|collectivelyconscious.net|colourlovers.com|connectedly.com|convertcase.net|couch-tuner.me|couchtuner.ac|couchtuner.us|crackberry.com|daclips.in|dailycaller.com|demonoid.pw|destructoid.com|diffen.com|divxatope1.com|dreamfilm.se|dumpaday.com|dwatchseries.to|episodetube.com|episodetube.net|fastpic.ru|fileone.tv|filme-streamz.com|filmlinks4u.is|firstforwomen.com|firstrowau.eu|firstrowus1.eu|flash-x.tv|flashsx.tv|flashx.co|flashx.me|flashx.run|flashx.tv|flashx1.tv|flashxx.tv|fmovies.to|fmovies.world|free-torrent.org|free-torrent.pw|free-torrents.org|free-torrents.pw|freewarefiles.com|gamenguide.com|genfb.com|gofirstrow.eu|gomovies.sc|gorillavid.in|gsmarena.com|hdvid.life|hdvid.tv|hdvid.xyz|health-weekly.net|homerun.re|i4u.com|ifirstrow.eu|ifirstrowit.eu|imagefap.com|instanonymous.com|investopedia.com|itechpost.com|izismile.com|jewsnews.co.il|keepvid.com|kino-streamz.com|kiplinger.com|kshowonline.com|lifehacklane.com|livecricketz.org|livescience.com|lolcounter.com|megaup.net|merriam-webster.com|mobilenapps.com|mobipicker.com|movies123.xyz|movies4stream.com|movpod.in|myfeed4u.me|myfeed4u.net|mylivecricket.org|natureworldnews.com|navbug.com|ncscooper.com|newpct.com|newsarama.com|newseveryday.com|newtvworld.com|nowfeed2all.eu|nowvideo.club|nowvideo.pw|okceleb.com|oloadcdn.net|olympicstreams.me|omgwhut.com|onvid.club|onvid.xyz|onwatchseries.to|openload.co|openload.pw|opensubtitles.org|otakustream.tv|parentherald.com|pcgamer.com|pcgames-download.net|playbill.com|pocketnow.com|pornhub.com|postimg.org|powvideo.net|pwinsider.com|qaafa.com|rapidvideo.com|reservedoffers.club|rinf.com|roadracerunner.com|salefiles.com|scienceworldreport.com|sgvideos.net|shorte.st|skidrowcrack.com|snoopfeed.com|stream-tv-series.net|stream-tv2.to|stream2watch.cc|streamazone.com|streamgaroo.com|strikeout.co|strikeout.me|strikeout.mobi|teamliquid.net|technobuffalo.com|the123movies.org|the4thofficial.net|thefreethoughtproject.com|thevideo.me|thinkinghumanity.com|todayshealth.buzz|tomsguide.com|tomshardware.co.uk|tomshardware.com|tomsitpro.com|toptenz.net|torrentz2.eu|tribune.com.pk|trifind.com|tune.pk|tv-series.me|uberhavoc.com|universityherald.com|vcpost.com|vidmax.com|vidoza.net|vidtodo.com|vidzi.online|vidzi.si|vidzi.tv|viewmixed.com|viid.me|vipbox.bz|vipbox.is|vipbox.sx|vipbox.tv|vipboxeu.co|vipboxoc.co|vipboxtv.me|vipleague.ch|vipleague.co|vipleague.is|vipleague.me|vipleague.mobi|vipleague.se|vipleague.sx|vipleague.tv|vipleague.ws|vipstand.is|viralands.com|vrheads.com|watchepisodes-tv.com|watchseries.li|webfirstrow.eu|wholecloud.net|whydontyoutrythis.com|wrestlinginc.com|wrestlingnews.co|x1337x.eu|x1337x.se|x1337x.ws|xda-developers.com|xilfy.com|yourtailorednews.com|yourtango.com
$csp=script-src 'self' * 'unsafe-inline',domain=fflares.com|fileflares.com|ibit.to|piratbaypirate.link|unblocktheship.org|noobnoob.rocks|indiaproxydl.org|magnetbay.eu|airproxyproxy.pw|thepirate.xyz|pietpiraat.org|ahoypirate.in|tpb.tw|proxyindia.net|thepiratebay.blue|ahoypiratebaai.eu|pirate.bet|airproxytpb.red|ikwildepiratebay.xyz|piratebay.tel|bayception.pw|piratebay.town|superbay.link|thepiratebay.kiwi|tpb.one|baypirateproxy.pw|rarbgmirrored.org|rarbgmirror.org|rarbg.to|rarbgaccess.org|rarbgmirror.com|rarbgmirror.xyz|rarbgproxy.org|rarbgprx.org|mrunlock.pro|downloadpirate.com|prox4you.xyz|123unblock.info|nocensor.icu|unlockproject.live|pirateproxy.bet|thepiratebay.vip|theproxybay.net|thepiratebay.tips|thepiratebay10.org|prox1.info|kickass.vip|torrent9.uno|torrentsearchweb.ws|pirateproxy.app|ukpass.co|theproxybay.net|thepiratebay.tips|prox.icu|proxybay.ga|pirateproxy.life|piratebae.co.uk|berhampore-gateway.ml|ikwilthepiratebay.org|thepiratebay10.org|bayfortaiwan.online|unblockthe.net|cruzing.xyz
$csp=worker-src 'none',domain=estream.to|flashx.cc|flashx.co|flashx.co|streamango.com|vidoza.net|vidto.me|vidto.se|vidtudu.com
|http://$object,domain=pornhub.com|redtube.com|youporn.com
|https://$object,domain=pornhub.com|redtube.com|youporn.com
$websocket,domain=24video.me|pornhub.com|redtube.com|redtube.com.br|tube8.com|tube8.es|tube8.fr|xhamster.com|xtube.com|xvideos.com|youporn.com|youporngay.com
.win^$other,domain=pornhub.com|redtube.com|redtube.com.br|tube8.com|tube8.es|tube8.fr|xtube.com|youporn.com|youporngay.com

Hope this helps! 😊

EDIT: as a temporary workaround, I came up with this code after loading the rules:

IReadOnlyList<string> exclusions = new[]
{
    "$webrtc",
    "$media",
    "$websocket",
    "$csp",
    "|http://$object",
    "|https://$object",
    ".win^$other"
};

var validRules = (
    from rule in urlFilters
    where !exclusions.Any(start => rule.OriginalRule.StartsWith(start)) &&
            rule.Parts.Count > 0
    select rule).ToArray();

This seems to be working for now, but of course it's not unstable, as there might very well be other unsupported/glitched rules I haven't noticed for now, that could cause issues with other websites.

TechnikEmpire commented 5 years ago

Yeah so there's probably code in the base rule that returns true by default if no nested rules or portions of the rule return false. Pretty sure my overall logic works on negation first, and defaults to a positive in the absence of anything else. It's been a while since I looked at it but when I get some time, I will address it. The code itself is not overly complex if you want to push ahead in the meantime based on this info.

TechnikEmpire commented 5 years ago

I also know that part of my intention with this project was maximum speed, so preprocessing data to ensure that you are feeding valid data was sort of intended to be outside of the scope of the project. I wanted this to be stupid fast, and pausing execution look for errors in the input is considered generally out of scope. But I am concerned with the example of $webrtc,domain=hdmoza.com because, despite webrtc being an unsupported option, the rule is still valid and should only produce a match on hdmoza.com.

Sergio0694 commented 5 years ago

Yeah, the IsMatch method works on negation first as you said. The issue here is not so related with preprocessing or checking for errors though, it just looks like those rules in particular are not being matched correctly. That is, all those in the exclusions list mentioned above. And as you said, since those are valid rules, they should be handled correctly by the lib.

This workaround I'm using seems to work fine for now though, so feel free to look into it when you have some time, there's no rush 😊 The rest of the code works great, so again, props for the good work!

Sergio0694 commented 5 years ago

Also found another issue: this URL https://vsco.co/goodvibes-vsco/media/5c2fb09c8e529d3e52b4231b?share=MTU0NjYyOTI4MA%3D%3D Is blocked by this rule %3D$popup,domain=nme.com It looks like the library in general isn't properly handled rules that aree restricted to specific domains 🤔

igvk commented 5 years ago

Here is a quick fix for the rules that contain domain: First of all, you need to know the domain of the browser page/frame (it is different from the host part of the url). So, I changed the IsMatch method parameters to include the domain: public bool IsMatch(Uri uri, string domain, NameValueCollection rawHeaders)

The part of code that handles this domain is the following:

            if (domain != null && (ApplicableDomains.Count > 0 || ExceptionDomains.Count > 0))
            {
                if (domain.StartsWithQuick("www."))
                    domain = domain.Substring(4);
                if (ApplicableDomains.Count > 0 && !ApplicableDomains.Contains(domain))
                    return false;
                if (ExceptionDomains.Count > 0 && ExceptionDomains.Contains(domain))
                    return false;
            }

There can still be things that need to be fixed, I will try to discover them in my testing.

If you wish, I may make a PR with the fixes.

TechnikEmpire commented 5 years ago

Ok guys you've got my attention, I'm going to look at this. I'm really surprised if this is a legit issue because this package has shipped in commercial software and I didn't hear any complaint. However, those people weren't the most capable either so, will let you know. Also any PR's are definitely welcome.

TechnikEmpire commented 5 years ago

Waaaaaaw. I found the issue. So a client requested a feature that we be able to match against the referer field. When I made that change, I changed the domain checks to referer checks with the intention of duplicating those checks and making both the referer and domain checks happen. Unfortunately I forgot to put the domain checks back. I'm pushing the fix and releasing a new package. Sorry about that guys.

What would be really neat is to add a bunch of unit tests that check different kinds of rules to ensure conformance. If any of you get really bored, maybe a PR for unit tests would be really awesome!

Thanks again for reporting.

TechnikEmpire commented 5 years ago

Fixed by 4e018c3

TechnikEmpire commented 5 years ago

It looks like you deleted a comment. Domains are now handled in the main filter class so should be all good. If you find something new please do open a new issue.

Sergio0694 commented 5 years ago

Hey @TechnikEmpire - thank you for the update, and I'm glad I actually helped you find an issue in the library! I've updated to the newest package and removed my workaround, everything seems to be working fine now! 😊

P.S. I have not deleted any comments, not sure what you meant by that 🤔

TechnikEmpire commented 5 years ago

@Sergio0694 Someone else commented and it only showed up in my email. Anyway thanks again for your help.