apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.22k stars 643 forks source link

Set desiredConcurrency based on type of crawler and available memory #1788

Open metalwarrior665 opened 1 year ago

metalwarrior665 commented 1 year ago

Which package is the feature request for? If unsure which one to select, leave blank

None

Feature

Unless you are running with very low memory, usually you want the scrapers to already start with some concurrency to speed up the initial part of the scraping. This is even more important if the scraper is very short, in that case the crawler might not even have a chance to upscale.

We can make a conservative mapping of the desiredConcurrency based on available memory.

Motivation

As above

Ideal solution or implementation, and any additional constraints

Example solution here: https://github.com/apify-projects/store-crawler-google-places/blob/master/src/utils/misc-utils.ts#L485

We probably don't want to silently reduce maxCnocurrency like in the example

Alternative solutions or implementations

No response

Other context

No response

B4nan commented 1 year ago

We already do that based on the type, cheerio sets the default to 10 (#1428).

Note that the solution you proposed seems to be quite tied to the apify platform, we should be careful with that, crawlee needs to work out of box in other environments too, and I am not sure how common is our memory/cpu constraints ("4gb = 1cpu".)

But maybe it will work just fine, we just need to test it carefully. Looks like the proposal sets the concurrency to half the memory in gh, which is much less aggressive than the static 10 we have now for cheerio - but on the other hand it could be too aggressive for a browser crawler.

metalwarrior665 commented 1 year ago

The code is more an example than a proposal since it was tailored to Google Maps which is quite heavy. We can make it scale a bit higher with Cheerio and a bit lower with the browser to be on the safer side.

I guess we are not able to measure CPU allocation (from inside the local Docker container and such) so the memory would be an approximation. On most cloud platforms, those should scale up more or less together.

@mnmkng can probably chime in since he was setting up the initial auto-scaling.