Open barjin opened 1 month ago
Another possible problem is that if there's a high failure rate, you could get way less that maxRequestsPerCrawl
results if you cut off the request queue too early.
Afaik that's expected with maxRequestsPerCrawl
- if e.g. maxRequestsPerCrawl: 20
, only 20 Request
objects will be processed (and possibly retried on errors maxRequestRetries
times), regardless on the success / failure state.
If I understand the current codebase correctly, the > maxRequestsPerCrawl
requests in the RQ will never be touched.
Dynamic crawlers with
RequestQueue
often enqueue URLs that never get processed because of themaxRequestsPerCrawl
limit. This causes unnecessary RQ writes, which can be expensive - both computationally and financially in the case of RQ cloud providers.The calls to
enqueueLinks
oraddRequests
on the crawler instance might turnnoop
as soon as the relatedRequestQueue
's length reaches themaxRequestsPerCrawl
.Possible issues & considerations
RQ.addRequests
must still work as before (maxRequestsPerCrawl
is a crawler option).