apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.28k stars 649 forks source link

Share PuppeteerPool between multiple PuppeteerCrawlers #275

Closed mponizil closed 5 years ago

mponizil commented 5 years ago

Hi there,

Having a lot of fun learning about the Apify architecture! I have a use-case where I'm scraping ~20 different sites and want to manage concurrency to each of them separately (e.g. should always be 20 requests at once, but never more than concurrencyFor(site) to the same site).

It seems like it should be feasible to create my own PuppeteerPool and have PuppeteerCrawler use that instead of always creating a new one? I suppose if I were using Apify Cloud I could just have 20 crawlers, but trying to run this manually.

I may be thinking about things wrong, so any suggestions appreciated. Happy to try a PR if it makes sense. Thanks!

mponizil commented 5 years ago

This seems to work, though suspect there are other things to consider: https://github.com/mponizil/apify-js/commit/4fb9df417346cdbd35986218d854e40fcdf83efe

mnmkng commented 5 years ago

Hello there,

that's an interesting use case. Correct me if I'm wrong, but the task is to crawl 20 different sites at the same time, while being able to manage concurrency of each of the sites separately and also limit the total concurrency of all sites combined.

The important thing to realize is that concurrency is managed by the AutoscaledPool class, while PuppeteerPool only manages browsers and their tabs.

It goes like this: The crawler wants to run a new request and asks the AutoscaledPool if there are slots open. If there is a slot open, AutoscaledPool runs a new task to process the crawler's request. In processing of this request, a new browser tab may or may not be needed and a new browser may or may not be needed, depending on the configuration of PuppeteerPool. PuppeteerPool does not manage the task runs, it just provides browser resources when it gets told to do so.

Brainstorming your usecase now.

You should not run more than one AutoscaledPool per process, because it manages the concurrency using system readings, doing those 20 times instead of once would downgrade performance too much. So, you'll need to have one AutoscaledPool for the global concurrency limit.

You'll also need just one PuppeteerPool, because having 20 pools would spawn at least 20 browsers, which doesn't make sense from a performance point of view too.

Thus, having one AutoscaledPool and one PuppeteerPool, it makes sense using just one PuppeteerCrawler. There are other reasons that I won't list, but e.g.: PuppeteerCrawler cleans up after itself by destroying the PuppeteerPool. So if you manage to connect a pool to more crawlers, the first to finish would destroy your pool and the others would fail.

So, to conclude, the best way to go about this would be to run all the 20 sites using one PuppeteerCrawler and implement some custom magic to manage the individual site requests using the options.gotoFunction. Are you using RequestList or RequestQueue? How does the scraping process look?

I know this is not ideal, because the handlePageFunction is gonna be one ugly 20 item switch statement, but the way the SDK is wired up now, there's no way to reliably run 20 parallel crawlers in a single environment while sharing a browser pool and individually managing concurrency.

@mtrunkat @jancurn Ideas?

mponizil commented 5 years ago

Ah I see, thanks @mnmkng. That does validate a lot of the brainstorming I've been doing. I was also wondering about the implications of 20 AutoscaledPools so good to know that should be avoided. And I also considered the 20 item switch statement so helpful to know I wasn't crazy ending up there. Seems easy enough to address readability with some standardized page type info in userData and a sort of router using a map.

I am using a RequestQueue and scraping is generally: visit 1-20 list pages, enqueue detail page urls, then grab contents from all detail pages. Seems like I have two options... do some logic when popping from RequestQueue to see if the URL should be done now or requeued for later. Or have some logic that stores Requests elsewhere and only adds them to RequestQueue when there's capacity for that domain.

Appreciate the quick response and suggestions!

mponizil commented 5 years ago

The other thing I'm wondering is if I could extend RequestQueue so that fetchNextRequest takes care of my concurrency-per-site logic?

mnmkng commented 5 years ago

Yes, my thoughts exactly @mponizil. I was thinking about a simple manager class with a queue per site format, where you'd drop all the collected URLs instead of enqueuing them directly. Then you would wait to have at least one URL in each queue and enqueue the first batch of 20 urls, 1 for each site, or some such.

This should net you a reasonable concurrency spread all by itself, unless you want to load some pages more than others, but that's just a minor change in the above algorithm.

Then you'd have to keep track of running Requests per site, keeping a count and whenever you'd receive a concurrency limited site to process, you'd just throw an Error, saying that the concurrency for the given site is too high and it will automatically be enqueued to the end of the queue again for retrying.

There's just this one little hack, make sure to decrement the request.retryCount while throwing those Errors, so you don't use up your maxRequestRetries.

And to answer your question. Normally, you wouldn't be able to do that because the fetchNextRequest is simply a wrapper over the Request Queue API with some additional stuff thrown in, but since you'll be running it locally, you could always change the filesystem writes to your liking. But I would not suggest that as a reasonable option. It's a queue. It's not built for being searched for specific items.

One last thing comes to mind. If some sites are significantly slower than others, you might end up with only one or two sites at the end of the queue. If you want to limit the concurrency then, you can use the autoscaledPool.setMaxConcurrency() function to change the global limit when you need it.

mponizil commented 5 years ago

Think I've nearly got it, but a few things I'm getting stuck on. Here's the code I've stubbed out thus far: https://gist.github.com/mponizil/5f932fd1a9520980ce27c7f9bf45c8a4

  1. Do you think it's necessary to use RequestQueues for each site, or a simple array is fine?
  2. You mention just adding 20 requests to kick things off. How would additional ones get added? In my code, I just add all that have been accumulated, though suspect doing some sort of zipping would be useful to avoid a ton of unnecessary error throws and re-queueing. Did I miss something here?
  3. Struggling to figure out how to track running Requests per site. I increment a counter in gotoFunction, but when can I decrement?

Really appreciate you taking the time to talk me through this. Hopefully it can be useful for somebody else as well!

mponizil commented 5 years ago

Also perhaps you could help me understand what I'd be missing out on if I don't use a Crawler at all? Is there more that I'm missing in the following gist besides timeouts, error-handling, and auto-retries?

https://gist.github.com/mponizil/1d6cc01d95d6b22d3901dd34d871ce8d

To be clear - I suspect the things this is missing can get fairly complex and something I'd rather lean on Apify to handle for me, but just wondering if the right trade-off for my use-case will be to do a rudimentary rendition of that stuff myself instead.

mnmkng commented 5 years ago

The trouble with crawling at scale is that it fails all the time. So timeouts, error-handling and auto-retries are probably the most important things you could have. Sure you can stitch it up yourself, but why bother, when we've already done it for you and have quite a lot of tests to support it. It's also logging that you'll be missing out. Most of the logic is in BasicCrawler so feel free to check out the source code. PuppeteerCrawler just builds on top of that.

And now to your questions:

  1. I'd just use one RequestQueue and an array per site. The arrays are only a buffer for enqueuing, so no reason to make it overly complex.
  2. You'll add 20/40/60... more each time there's at least 1/2/3... items in each of the 20 arrays. Or some tweaked version of this, if you don't expect to get the same number of links for each site.
  3. That's actually a fair point. There's no obvious way to do this. You can wrap and replace the requestQueue.markRequestHandled() AND requestQueue.reclaimRequest() functions with your own handlers. Not very clean, but it will get the job done. Something along the lines of:
const originalFn = requestQueue.markRequestHandled;
requestQueue.markRequestHandled = async (...args) => {
    decrementSiteCount();
    return originalFn.apply(requestQueue, args);
};
mnmkng commented 5 years ago

Closing this since it's not an issue. Feel free to continue the discussion here.

havardox commented 6 months ago

Has the SDK changed in any so now you don't need one Crawler to run multiple sites? Or is it the same?

havardox commented 6 months ago

Also @mponizil, do you mind sharing the final version of MultiSiteManager, if you still have it? I'm dealing with the same exact problem myself