apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.73k stars 322 forks source link

autoscaled_pool.desired_concurrency_ratio is useless #759

Open Pijukatel opened 5 days ago

Pijukatel commented 5 days ago

It seems to me that autoscaled_pool.desired_concurrency_ratio is currently completely useless.

It is in init here: https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L57

It is checked for bounds 0> desired_concurrency_ratio >=1 https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L97

It is used in only one place: https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L198

And from there in condition here: https://github.com/apify/crawlee-python/blob/master/src/crawlee/_autoscaling/autoscaled_pool.py#L202

That condition will always be true for currently runtime enforced values 0> desired_concurrency_ratio >=1:

self.current_concurrency >= math.floor(self._desired_concurrency_ratio * self.current_concurrency)

Did I miss something?

janbuchar commented 4 days ago

Before we look deeper into this, did you cross-check this against the JS version? Does it do the same thing there?

Pijukatel commented 4 days ago

I think it is exactly the same

Same bounds: https://github.com/apify/crawlee/blob/master/packages/core/src/autoscaling/autoscaled_pool.ts#L226

Same usage: https://github.com/apify/crawlee/blob/30d0bf807f110b2bacf74d8099e9565edb509df7/packages/core/src/autoscaling/autoscaled_pool.ts#L598

vikyw89 commented 1 day ago

In js concurrency setting works, in here it doesn't

Pijukatel commented 14 hours ago

Yes, in JS it is different. I read it incorrectly earlier. So it seems like there was a mistake when porting the code to Python.

JS: const minCurrentConcurrency = Math.floor(this._desiredConcurrency * this.desiredConcurrencyRatio);

Python: min_current_concurrency = math.floor(self._desired_concurrency_ratio * self.current_concurrency)

janbuchar commented 12 hours ago

Hm, then I might have made a mistake :slightly_smiling_face: Seems like an easy fix, right?

Pijukatel commented 9 hours ago

Yes, I will align the Python version with the JS version.