Open prakashgp opened 1 week ago
Thank you for your report!
It indeed seems that the forefront
option does not work as expected with the MemoryStorage
implementation of RequestQueue
.
As far as I figured, this is caused by two different problems:
forefront
option is encoded by a negative orderNo
value. However, we do not consider this in the listHead
and similar methods. This IMO causes the main issue with forefront
requests not being handled differently from the regular requests.RequestQueue
batching behavior - the RequestQueue
reads the requests in batches of 25 (and then processes those). Once the current batch is done, the RequestQueue
reads another batch. This works well for the append-only idea of a queue but breaks once we add "priority" requests - those aren't added to the current batch but to the "cold" queue (actually, due to 1., those are just appended to the end of the queue).I remember @vladfrangu did most of the work on RQv2 - are my assumptions correct here, or did I miss something? To be honest, I'm not sure how to solve this issue, really 🤷🏽 Maybe just add a special forefront
store that stores the forefront
-enqueued requests, aside from the actual queue? And add a performance warning to the option's documentation?
Oof, nice catch. We'll either have to collect all requests, sort, then list (which sounds super inefficient), store requests sorted (with smth like insert-sort) in-memory, or split forefront into its own map and go from there
Tick me if you encountered this issue on the Apify platform
@prakashgp your code does not contain Actor.init()
and Actor.exit()
calls, so it will use the default memory storage even on the platform. If you add those, you will use the API instead of memory storage, where the forefront option works just fine.
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
Add lots of initial urls to the request queue Once a url is scraped, add 2nd level scraping request with
forefront
= true. Records will be pushed to dataset in 2nd level scraping requests which are derived from result of 1st level request. But due to the bug and large number of initial requests, even if we push 2nd level requests to front of the queue, they still gets scraped at the end. And because of this scraper runs for long time without adding any records to the datasetCode sample
Package version
^3.11.3
Node.js version
v20.13.0
Operating system
Mac OS
Apify platform
I have tested this on the
next
releaseError: Detected incompatible Crawlee version used by the SDK. User installed 3.11.4-beta.0 but the SDK uses 3.11.3
Other context
No response