apify / crawlee

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev
Apache License 2.0
15.84k stars 683 forks source link

Deprecate minConcurrency in Crawler options, replace it with desiredConcurrency #1746

Open metalwarrior665 opened 1 year ago

metalwarrior665 commented 1 year ago

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/basic (BasicCrawler)

Issue description

minConcurrency is a dangerous option and desiredConcurrency is almost always a better option. The problem is minConcurrency is option on crawler level while desiredConcurrrency is not so users will naturally tend to use the former.

So I would deprecate it and leave it only in autoscaledPool while I would put desired on crawler level as it is useful.

One side-note is that desiredConcurrency is not a great name as it only affects where it starts. Something like initialConcurrency would be better so we could go for rename with deprecation as well.

Code sample

No response

Package version

any

Node.js version

any

Operating system

No response

Apify platform

I have tested this on the next release

No response

Other context

No response

B4nan commented 1 year ago

I like the proposal for renaming to initialConcurrency, we could rename it right ahead while keeping the old name for BC and clean this up in next major. Btw the minConcurrency is taken into account only when downscaling, that's the confusing part, right? The example in docs seems to be also incorrect (the comment above minConcurrency).

https://crawlee.dev/docs/guides/scaling-crawlers#minconcurrency-and-maxconcurrency