maxRequestRetries for Apify.Request class

apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.

Apache License 2.0

15.76k stars 678 forks source link

Use-case is that a regular crawler crawls homepage -> categories -> detail pages.

If homepages fails, the whole crawler fails. If the category fails, a lot of pages are missing, if detail fails, only 1 page is missing. Ideally, if you know the pressure on proxies is high, you would want to assign different max retries to differently important page types.

Currently, there are 2 workarounds:

Update retryCount of the request on the fly - dirty and produces wrong logs
Spawn multiple crawlers, one for each type - even dirtier

I guess the maxRetryCount on the Request could overwrite maxRetryCount on the crawler level. If not present, it would default to crawler level. It adds another if to the code but similar solution would be useful. Maybe there is something cleaner.

apify / crawlee

maxRequestRetries for Apify.Request class #510