Add PuppeteerPool.retire(browser) method

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Apache License 2.0

14.95k stars 625 forks source link

Use case:

I am scraping website with concurrency higher than one and have multiple tabs opened in each browser. I am able to detect in handlePageFunction or gotoFunction that I was blocked by anti-scraping protection. So I need to retry the URL with different IP address.

Workaround:

The easiest way is to kill a browser using browser.close() and to throw an error so that the request gets retried in a new browser. The problem is that this way all the opened tabs get killed however they may be processing successfully opened pages.

Proper implementation:

Add puppeteerPool.retire(browser) or browser.retire() method. And in the case mantioned above call:

handlePageFunction({ browser, puppeteerPool, request }) {
   ...
   puppeteerPool.retire(browser);
   throw new Error('Request was blocked!');
   ...
}

apify / crawlee

Add PuppeteerPool.retire(browser) method #122