apify / crawlee-python

Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
https://crawlee.dev/python/
Apache License 2.0
4.65k stars 319 forks source link

Make the AutoscaledPool log understandable #705

Open Pijukatel opened 1 week ago

Pijukatel commented 1 week ago

AutoscalePool periodically logs system load information in this function: AutoscaledPool._log_system_status

This looks for example like this:

2024-11-06T15:11:50.471Z [crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 1; desired_concurrency = 1; cpu = 0.581; mem = 0.0; event_loop = 0.227; client_info = 0.0

It shows values that are internally used by the desired_concurrency controller, but those value are hard to interpret by humans and thus not very useful to show in log. Make this log understandable.

On the other hand, the logged values should also be connected to values used by mentioned controller. If it gets readable, but detached from controller, then the log is again not very usable. So there is a risk that making this more readable would require changing the controller itself.

See full discussion in: https://github.com/apify/crawlee-python/issues/662