epitron / mitm-adblock

A fast adblocking proxy server (which works on HTTPS connections)
Other
178 stars 36 forks source link

Identifying ad elements on page via selectors + selenium #12

Open DGaffney opened 2 years ago

DGaffney commented 2 years ago

I know that this codebase appears to block based on network traffic by useing regexes to filter out URLs that are associated with ad serving, but is there, in this code base or any other, references for how one would identify elements on pages, once rendered, that would need to be removed from the page in order to drop all ads off a page's rendered DOM?

Bass-03 commented 2 years ago

That is very interesting. Adblockers that are browser extensions do that, they hide elements based on selectors.

Doing it at the proxy level might not be very effective because ads are injected by scripts. However, some websites do have the place holders already there, you might need to either test for every selector on lists like easylist, or build your own custom list and curate it.

is this helpful? What would you like to do?

DGaffney commented 2 years ago

Super helpful - after doing more digging, I think the easylist does contain the information I’m searching for - things like bing.com##.productAd is the syntax for these. If I’m correct, this indicates that I need to look for items on bing.com http://bing.com/ that match the class .productAd, right? Is there any parser that converts the easylist syntax into something else, or more robust directions for parsing it, that you’re aware of?

On Dec 6, 2021, at 10:52 AM, Edmundo Sanchez @.***> wrote:

That is very interesting. Adblockers that are browser extensions do that, they hide elements based on selectors.

Doing it at the proxy level might not be very effective because ads are injected by scripts. However, some websites do have the place holders already there, you might need to either test for every selector on lists like easylist https://easylist.to/, or build your own custom list and curate it.

is this helpful? What would you like to do?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/epitron/mitm-adblock/issues/12#issuecomment-987066223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADPJAC7ITQZH4I4V5R2MDUPUA6VANCNFSM5JPHY2XA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Bass-03 commented 2 years ago

I am so into this! You can learn about the syntax here https://adblockplus.org/filter-cheatsheet

And there is a python package to parse that, https://github.com/adblockplus/python-abp Let me know if you want any more help, My email is listed on my profile, I have some knowledge about this adblocking stuff, we can chat about that.

epitron commented 2 years ago

This would be great to add!

You're right, at the proxy level, you wouldn't be able to match the JS-created elements, but it's probably worth the effort for rules that do match (of which I assume there still are some). The biggest problem is sites which use Javascript to render the entire page, and emit essentially no HTML. They're relatively common these days.

Is beautiful soup fast enough to do real-time HTML transforms? (I'm not really a Python person.) A streaming XML processor probably isn't necessary since most pages' HTML is pretty tiny (10k-200k?). I guess, even if it's a bit slow, it'll still be faster than loading all the ads!

Unrelated (kinda): this codebase really needs a rewrite. It's an afternoon hack from 8 years ago, and it would be a lot nicer as a real unix-style commandline tool (with a --help screen and useful options and whatnot). It should also be using the Brave adblock engine's Python module, which is probably much faster than re2 (and written in Rust!)

I dunno, do you two use this thing much? Would these changes be helpful? Would you be interested in helping? I'm sure we're all busy. Just throwing these ideas out there in case you're interested!

DGaffney commented 2 years ago

Thanks to both of you! To be perfectly honest, I'm using this library only as a means to an end for a fairly separate issue - I am using selenium to visit URLs, then comparing the network transfer URLs loaded downstream of the root request against the rule set to mark network transfers as originating on ad servers with my own custom rules.should_block usage. I'm also moving further than that to look at the elements on the page that may or may not be ads - for that, I'm currently abp.filters from parse_filterlist in https://github.com/adblockplus/python-abp. Right now, for my proof of concept, I'm not too too concerned about speed, but long term it will be an issue - there's 29k CSS matching rules that I consider for each HTML document, where, for each rule, I have to run a selenium driver.find_elements_by_css_selector(rule) lookup, which takes ≈12-13 minutes per site right now. I'm sure there's more clever ways but the brute force is sufficient to at least show the idea works in principle - if you have thoughts about speeding up that portion, I'm all ears, but I should show my hand here and say my use cases for the repo you've built are tangential. That said, happy to help where it may be useful!

Bass-03 commented 2 years ago

hey @epitron I was looking into creating something like this a while back, I found this and I sort of stopped.

I have some insights on adblocking, I might be able to help.