apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.88k stars 686 forks source link

feat: creating multiple unnamed queues #2752

Open barjin opened 1 week ago

barjin commented 1 week ago

There are often reasons to make multiple separate RQs in one Crawlee project (e.g., having CheerioCrawler for processing most of the pages and a separate keep-alive PlaywrightCrawler instance for processing some specific pages the first crawler finds).

Supporting this use case without both crawlers reading the same queue is now possible only with named queues (e.g., RequestQueue.open('playwrightQueue')).

The named queues, however, don't get purged with a new script run, so in any subsequent run, the PlaywrightCrawler might skip some requests (due to the implicit RQ request deduplication). This forces users to run the script with rm -rf ./storage && npm start, or similar "hacks".

// open a secondary queue
const secondaryRQ = await RequestQueue.open('Bqueue');

const crawlerA = new CheerioCrawler({
  // use the default queue with crawlerA, and add requests to the secondary queue
  requestHandler: async ({ request }) => {
    console.log(`[A] ${request.url}`);

    await secondaryRQ.addRequest({ url: request.url });
  }
});

const crawlerB = new CheerioCrawler({
  // consume the secondary queue
  requestQueue: secondaryRQ,
  requestHandler: ({ request }) => {
    console.log(`[B] ${request.url}`);
  },
});

await crawlerA.run(['http://example.com']);
await crawlerB.run();

Repeated runs yield different results:

$ npx tsx ./a.ts 

[A] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
[B] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}

$ npx tsx ./a.ts 

[A] http://example.com
INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
INFO  CheerioCrawler: Finished! Total 0 requests: 0 succeeded, 0 failed. {"terminal":true}

Moreover, Apify API supports creating multiple unnamed queues. The named queue solution is even more problematic on the Apify Platform since the named queues created by Apify Actors are stored indefinitely on the user's account, causing the users to spend credits on storage (often) unknowingly.

janbuchar commented 1 week ago

Thanks for opening this! We also talked about how we need to remember IDs of non-default unnamed queues between migrations.

My first idea would be an API like await RequestQueue.openTemporary("some-name"). On memory storage, we'd simply map this to storage/request_queues/__tmp_some-name, for example, and we'd remove this on start just like we do with default. On Apify, we'd have to keep a mapping of storage name => storage id in the key-value store to preserve the storages. Apart from that, it should make no difference.

The same could apply to all three storage types, not just request queues.

barjin commented 1 week ago

Thanks for the points!

await RequestQueue.openTemporary("some-name")

I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.

On Apify, we'd have to keep a mapping of storage name => storage id in the KVS

Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.

janbuchar commented 1 week ago

await RequestQueue.openTemporary("some-name")

I like it, apart from the fact that this would make RequestQueue.openTemporary("default") and RequestQueue.open() (and RequestQueue.open(null)) equal... I'm not sure if it's a bad thing; right now, I think it might be confusing.

Yeah, the storage thing is pretty confusing as a whole. To boot, the name "default" has special meaning in memory storage (which is filesystem-backed, obviously), but not on Apify. So, maybe we could just disable RequestQueue.openTemporary("default") with throw new Error("You don't want this, trust me bro").

Also, I'm not married to the name openTemporary, I'm sure we could come up with something better.

On Apify, we'd have to keep a mapping of storage name => storage id in the KVS

Yeah, that's the one part I don't really like (to open an unnamed KVS, you'd first need to open the default unnamed KVS), but I'm afraid there is no way around it.

I mean, we already persist the state of multiple random components into the default key-value store, so I have no issue with that.