apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.84k stars 683 forks source link

Don't enqueue over `maxRequestsPerCrawl` #2728

Open barjin opened 1 month ago

barjin commented 1 month ago

Dynamic crawlers with RequestQueue often enqueue URLs that never get processed because of the maxRequestsPerCrawl limit. This causes unnecessary RQ writes, which can be expensive - both computationally and financially in the case of RQ cloud providers.

The calls to enqueueLinks or addRequests on the crawler instance might turn noop as soon as the related RequestQueue's length reaches the maxRequestsPerCrawl.

Possible issues & considerations

janbuchar commented 1 month ago

Another possible problem is that if there's a high failure rate, you could get way less that maxRequestsPerCrawl results if you cut off the request queue too early.

barjin commented 1 month ago

Afaik that's expected with maxRequestsPerCrawl - if e.g. maxRequestsPerCrawl: 20, only 20 Request objects will be processed (and possibly retried on errors maxRequestRetries times), regardless on the success / failure state.

If I understand the current codebase correctly, the > maxRequestsPerCrawl requests in the RQ will never be touched.