Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
I am scraping website with concurrency higher than one and have multiple tabs opened in each browser. I am able to detect in handlePageFunction or gotoFunction that I was blocked by anti-scraping protection. So I need to retry the URL with different IP address.
Workaround:
The easiest way is to kill a browser using browser.close() and to throw an error so that the request gets retried in a new browser. The problem is that this way all the opened tabs get killed however they may be processing successfully opened pages.
Proper implementation:
Add puppeteerPool.retire(browser) or browser.retire() method. And in the case mantioned above call:
handlePageFunction({ browser, puppeteerPool, request }) {
...
puppeteerPool.retire(browser);
throw new Error('Request was blocked!');
...
}
Use case:
I am scraping website with concurrency higher than one and have multiple tabs opened in each browser. I am able to detect in
handlePageFunction
orgotoFunction
that I was blocked by anti-scraping protection. So I need to retry the URL with different IP address.Workaround:
The easiest way is to kill a browser using
browser.close()
and to throw an error so that the request gets retried in a new browser. The problem is that this way all the opened tabs get killed however they may be processing successfully opened pages.Proper implementation:
Add
puppeteerPool.retire(browser)
orbrowser.retire()
method. And in the case mantioned above call: