TechnikEmpire / DistillNET

DistillNET is a library for matching and filtering HTTP requests and HTML response content using the Adblock Plus Filter format.
Mozilla Public License 2.0
15 stars 4 forks source link

Future Goals: CSS selectors and CEFSharp #6

Open Esowteric opened 6 years ago

Esowteric commented 6 years ago

Great project, thanks! Have got URL blocking working fine in a CEFSharp browser project's BrowserRequestHandler class. It stops a torrent of ads on sites like www.tvguide.co.uk, and the code (called in OnBeforeBrowse() and OnBeforeResourceLoad()) is fast.

Looking forward to being able to hide content using CSS.

In the case of CEFSharp, I already examine the DOM in .NET using HtmlAgilityPack. To get the DOM, I can either use frame.GetSourceAsync() or browser.GetSourceAsync() in the browser's FrameLoadEnd() handler. Or I can get it from javascript, using a .NET BoundObject class (that allows communication between .NET and javascript).

Not sure how to apply the CSS to the loaded content. In OnFrameLoadEnd(), I can call ExecuteJavaScriptAsync(), and perhaps that could be used to set the CSS?

Sorry: I'm just thinking aloud here. Many thanks again.

TechnikEmpire commented 6 years ago

@Esowteric Thanks for the info and it's nice to see someone getting use out of your work. Yes the filter engine here is stupidly fast and AFAIK this is pretty much full blown Adblock plus syntax supported here, minus rarely use things like ping etc, so you should be able to just load stuff like easylist straight in, although I have not tried this.

I wrote all of this functionality in another project in C++ once before. I had CSS filtering done courtesy of GQ which relies on Gumbo parser and is very fast. Just my tree building code in that project needed some work as it is the current bottleneck. I had planned to use MyHTML to re-implement this functionality for max speed, but that's wandering in C++/CLI land and I'd ideally like to make this project portable so it can be used in Xamarin for iphone/android etc.

Short version: I'm still planning how to tackle this but it's already been done before, it's only a matter of time before this is added.

TechnikEmpire commented 6 years ago

Regarding applying the changes in CEF sharp I'm not sure. I'm just about to have a peek at CEF sharp to see if/where I can modify images before they load.

Esowteric commented 6 years ago

Many thanks for the response and the pointers, @TechnikEmpire.

Btw, and you probably know this already: the reason for going round the island in CEFSharp is because, unlike the MS WebBrowser, CEFSharp doesn't directly expose the DOM, and not for write purposes. At least, I think I've got that right.

TechnikEmpire commented 6 years ago

@Esowteric Yeah it looks like you're right. You can do a little hack where you specify a filter for a specific mime type and then use a custom request handler to inject those filters. This way you can dynamically collect and parse/inspect/modify the content on the fly but it's super finicky, as the output stream (you're basically made the middle man between two streams) is of a fixed size, and exceeding those limits causes failures. It's a nightmare and rather unstable this feature but it appears to be all you've got.

Alternatively what you can do is spawn these filters and have them write-through the data, but collect a copy. Then OnResourceLoadComplete, look up the request's filter by the request ID, then grab the data, inspect it, and if you find something bad you can force navigation change or load different HTML. I dunno. The whole project seems useless, basically just the ability to re-skin Chrome.

Esowteric commented 6 years ago

Thanks again, Jessie @TechnikEmpire. I was using GeckoFX, but opted for CEFSharp because the project is still active, with a decent Q&A base.

Blocking URLs is fine: that's the good news.

The CSS really is a head-banger, though, because I want to grab the source html in LoadFrameEnd(), and that's async, so I'm trying various hacks to get that to wait. Followed by running the html past CSQuery (for now) and then injecting javascript to modify the actual content. And that JS call is also async (and doesn't want to work if I put it inside the earlier html-grabbing task).

So, no surprise that (a) it's a major bottleneck, (b) I have to invoke all over the place; and (c) that I'm getting exceptions like "The current SynchronizationContext may not be used as a TaskScheduler."

Still love the challenge, though. :)

With good wishes, Eric T.

TechnikEmpire commented 6 years ago

I think you can create your own synchronization context or look up the UI context and use it. I had to do this when I developed a multi-window WPF app once. The only problem I ran into doing that was having shared UI resources. You cannot share such resources across contexts when they're UI threads IIRC and so I had to do to per-window resource libraries instead of in app.xaml. That was a couple of years ago though so I might not be recalling all that correctly. Also you may want to look into AngleSharp instead of CsQuery as it's dead and abandoned and if the benchmarks to be believed, AngleSharp performs better.

Another, slightly more difficult approach you may want to try is to use an off screen browser. I did this once for another project with CEFSharp. I had two browser objects, one that the user saw, and another that was off-screen doing all the actual work. I too did something similar, where I would execute JS against the pages post-load, let it do it's work, then pull the results out load them into the visible browser. See this.

Best. Will report back when I get time to look into the CSS selector functionality.

Esowteric commented 6 years ago

Some great pointers there. All the best with your projects.

TechnikEmpire commented 6 years ago

I think this will get some attention soon, I'd just like to find a HTML parser to build upon that isn't garbage. I can't un-see the benchmarks of myhtml.

antoniskoin commented 5 years ago

Great project, thanks! Have got URL blocking working fine in a CEFSharp browser project's BrowserRequestHandler class. It stops a torrent of ads on sites like www.tvguide.co.uk, and the code (called in OnBeforeBrowse() and OnBeforeResourceLoad()) is fast.

Looking forward to being able to hide content using CSS.

In the case of CEFSharp, I already examine the DOM in .NET using HtmlAgilityPack. To get the DOM, I can either use frame.GetSourceAsync() or browser.GetSourceAsync() in the browser's FrameLoadEnd() handler. Or I can get it from javascript, using a .NET BoundObject class (that allows communication between .NET and javascript).

Not sure how to apply the CSS to the loaded content. In OnFrameLoadEnd(), I can call ExecuteJavaScriptAsync(), and perhaps that could be used to set the CSS?

Sorry: I'm just thinking aloud here. Many thanks again.

Is it possible to show the way you implemented it on CefSharp? I wasn't able to achieve this yet.