Open barjin opened 1 week ago
Thanks for opening this! We also talked about how we need to remember IDs of non-default unnamed queues between migrations.
My first idea would be an API like await RequestQueue.openTemporary("some-name")
. On memory storage, we'd simply map this to storage/request_queues/__tmp_some-name
, for example, and we'd remove this on start just like we do with default
. On Apify, we'd have to keep a mapping of storage name => storage id
in the key-value store to preserve the storages. Apart from that, it should make no difference.
The same could apply to all three storage types, not just request queues.
Thanks for the points!
await RequestQueue.openTemporary("some-name")
I like it, apart from the fact that this would make RequestQueue.openTemporary("default")
and RequestQueue.open()
(and RequestQueue.open(null)
) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.
On Apify, we'd have to keep a mapping of
storage name => storage id
in the KVS
Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.
await RequestQueue.openTemporary("some-name")
I like it, apart from the fact that this would make
RequestQueue.openTemporary("default")
andRequestQueue.open()
(andRequestQueue.open(null)
) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.
Yeah, the storage thing is pretty confusing as a whole. To boot, the name "default" has special meaning in memory storage (which is filesystem-backed, obviously), but not on Apify. So, maybe we could just disable RequestQueue.openTemporary("default")
with throw new Error("You don't want this, trust me bro")
.
Also, I'm not married to the name openTemporary
, I'm sure we could come up with something better.
On Apify, we'd have to keep a mapping of
storage name => storage id
in the KVSYeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.
I mean, we already persist the state of multiple random components into the default key-value store, so I have no issue with that.
There are often reasons to make multiple separate RQs in one Crawlee project (e.g., having
CheerioCrawler
for processing most of the pages and a separate keep-alivePlaywrightCrawler
instance for processing some specific pages the first crawler finds).Supporting this use case without both crawlers reading the same queue is now possible only with named queues (e.g.,
RequestQueue.open('playwrightQueue')
).The named queues, however, don't get purged with a new script run, so in any subsequent run, the
PlaywrightCrawler
might skip some requests (due to the implicit RQ request deduplication). This forces users to run the script withrm -rf ./storage && npm start
, or similar "hacks".Repeated runs yield different results:
Moreover, Apify API supports creating multiple unnamed queues. The named queue solution is even more problematic on the Apify Platform since the named queues created by Apify Actors are stored indefinitely on the user's account, causing the users to spend credits on storage (often) unknowingly.