Don't enqueue over `maxRequestsPerCrawl`

barjin commented 1 month ago

Dynamic crawlers with RequestQueue often enqueue URLs that never get processed because of the maxRequestsPerCrawl limit. This causes unnecessary RQ writes, which can be expensive - both computationally and financially in the case of RQ cloud providers.

The calls to enqueueLinks or addRequests on the crawler instance might turn noop as soon as the related RequestQueue's length reaches the maxRequestsPerCrawl.

Possible issues & considerations

This might be breaking for users reading the RQ after the enqueuing crawler stops on the limit.
This would only work for crawler (helper) methods, RQ.addRequests must still work as before (maxRequestsPerCrawl is a crawler option).

janbuchar commented 1 month ago

Another possible problem is that if there's a high failure rate, you could get way less that maxRequestsPerCrawl results if you cut off the request queue too early.

barjin commented 1 month ago

Afaik that's expected with maxRequestsPerCrawl - if e.g. maxRequestsPerCrawl: 20, only 20 Request objects will be processed (and possibly retried on errors maxRequestRetries times), regardless on the success / failure state.

If I understand the current codebase correctly, the > maxRequestsPerCrawl requests in the RQ will never be touched.

apify / crawlee

Don't enqueue over `maxRequestsPerCrawl` #2728

Possible issues & considerations