apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
13.77k stars 586 forks source link

Integrate adblocker functionality #456

Open jakubbalada opened 4 years ago

jakubbalada commented 4 years ago

Interesting tip from HN (for Dashblock): Maybe you already do it, but I think integrating adblocker functionality when loading JS sites would be desirable to reduce load time. And if ads are what the API user is interested in, perhaps add a flag for whether or not one wants ads to load. Recommendation: https://github.com/cliqz-oss/adblocker Should be the fastest adblocker library (used by Ghostery, Cliqz and Brave)

mtrunkat commented 4 years ago

This could be integrated into Apify.launchPuppeteer() function as useAdBlock: true option.

https://sdk.apify.com/docs/api/apify#module_Apify.launchPuppeteer

Darking360 commented 4 years ago

Greetings. So the thing would be to implement ad blocker to increase the speed of the scrap/crawl? I could work on this 🙏

mtrunkat commented 4 years ago

Yes exactly, it could boost the speed especially for some websites that are heavy on ads (news sites). But it would be great to first test this assumption. Would you be interested also in trying this out? Use Apify SDK to run scraper with and without ad blocker against some websites?

Darking360 commented 4 years ago

Sure! I can set up a test and run it to check this first with some timing debug, I'll create it and run it, then attach it here for you to see, thank you 🚀

pocesar commented 4 years ago

interesting. I manually block all the common ad networks using blockRequests, this would offload the task to the extension

deleted-user-1 commented 4 years ago

Makes sense for a lot of users I guess but fyi it's an explicit anti-feature with usecase-killing effect for me. I'd need this off with zero sideeffects on current behavior.

remusao commented 4 years ago

Makes sense for a lot of users I guess but fyi it's an explicit anti-feature with usecase-killing effect for me. I'd need this off with zero sideeffects on current behavior.

In the small POC I proposed a while ago https://github.com/apify/apify-js/pull/600, the feature is completely disabled by default and only does some work when blocking is enabled by the user.

mnmkng commented 4 years ago

Yeah, sorry @remusao . We still have not figured out if the performance will improve or not. I apologize.

remusao commented 4 years ago

Yeah, sorry @remusao . We still have not figured out if the performance will improve or not. I apologize.

Of course, no worries at all, I just wanted to make clear to @matjaeck that there should be a way to integrate such a feature without any overhead when it's disabled.