Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Use-case is that a regular crawler crawls homepage -> categories -> detail pages.
If homepages fails, the whole crawler fails. If the category fails, a lot of pages are missing, if detail fails, only 1 page is missing. Ideally, if you know the pressure on proxies is high, you would want to assign different max retries to differently important page types.
Currently, there are 2 workarounds:
Update retryCount of the request on the fly - dirty and produces wrong logs
Spawn multiple crawlers, one for each type - even dirtier
I guess the maxRetryCount on the Request could overwrite maxRetryCount on the crawler level. If not present, it would default to crawler level. It adds another if to the code but similar solution would be useful. Maybe there is something cleaner.
Use-case is that a regular crawler crawls homepage -> categories -> detail pages.
If homepages fails, the whole crawler fails. If the category fails, a lot of pages are missing, if detail fails, only 1 page is missing. Ideally, if you know the pressure on proxies is high, you would want to assign different max retries to differently important page types.
Currently, there are 2 workarounds:
I guess the
maxRetryCount
on the Request could overwritemaxRetryCount
on the crawler level. If not present, it would default to crawler level. It adds anotherif
to the code but similar solution would be useful. Maybe there is something cleaner.