apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.76k stars 678 forks source link

maxRequestRetries for Apify.Request class #510

Closed metalwarrior665 closed 1 year ago

metalwarrior665 commented 5 years ago

Use-case is that a regular crawler crawls homepage -> categories -> detail pages.

If homepages fails, the whole crawler fails. If the category fails, a lot of pages are missing, if detail fails, only 1 page is missing. Ideally, if you know the pressure on proxies is high, you would want to assign different max retries to differently important page types.

Currently, there are 2 workarounds:

I guess the maxRetryCount on the Request could overwrite maxRetryCount on the crawler level. If not present, it would default to crawler level. It adds another if to the code but similar solution would be useful. Maybe there is something cleaner.

B4nan commented 1 year ago

Implemented via #1925 in v3.3.3

metalwarrior665 commented 1 year ago

This made me nostalgic :D