apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
14.95k stars 625 forks source link

Add PuppeteerPool.retire(browser) method #122

Closed mtrunkat closed 6 years ago

mtrunkat commented 6 years ago

Use case:

I am scraping website with concurrency higher than one and have multiple tabs opened in each browser. I am able to detect in handlePageFunction or gotoFunction that I was blocked by anti-scraping protection. So I need to retry the URL with different IP address.

Workaround:

The easiest way is to kill a browser using browser.close() and to throw an error so that the request gets retried in a new browser. The problem is that this way all the opened tabs get killed however they may be processing successfully opened pages.

Proper implementation:

Add puppeteerPool.retire(browser) or browser.retire() method. And in the case mantioned above call:

handlePageFunction({ browser, puppeteerPool, request }) {
   ...
   puppeteerPool.retire(browser);
   throw new Error('Request was blocked!');
   ...
}
metalwarrior665 commented 6 years ago

This means the browser would wait for other tabs to finish their functions and then shut down?

mtrunkat commented 6 years ago

Yes, retired browser wait for all tabs to be closed (5min timeout).

mtrunkat commented 6 years ago

Fixed.