Open Jourdelune opened 4 months ago
It use a lot of ram also (after 4h-6h of crawl):
[crawlee.autoscaling.autoscaled_pool] INFO current_concurrency = 183; desired_concurrency = 173; cpu = 0.0; mem = 0.0; event_loop = 0.252; client_info = 0.0
[crawlee.autoscaling.snapshotter] WARN Memory is critically overloaded. Using 7.04 GB of 7.81 GB (90%). Consider increasing available memory.
[crawlee.statistics.statistics] INFO crawlee.beautifulsoup_crawler.beautifulsoup_crawler request statistics {
"requests_finished": 30381,
"requests_failed": 7,
"retry_histogram": [
30374,
7,
7
],
"request_avg_failed_duration": 1.340926,
"request_avg_finished_duration": 120.59418,
"requests_finished_per_minute": 87,
"requests_failed_per_minute": 0,
"request_total_duration": 3663781.171706,
"requests_total": 30388,
"crawler_runtime": 20939.883378
}
It's the user that have to limit the number of url added in the queue or the lib manage that? (hard limit etc)
Interesting. What is your total available memory?
32 Giga is available on my system.
I export my storage on Google Drive so you can test that: https://drive.google.com/file/d/1P8AgbgbVLmujiceYRtMIKK91zn9GVjen/view?usp=sharing
CRAWLEE_PURGE_ON_START=0 python test.py
When there is a lot of pending requests, crawlee is very very slow.
I'm seeing slow scraping too. About 200 requests per minute. I even self host the webpage to scrape. There are numerous times when scraper literally does nothing and waits for something.
@marisancans Would you mind sharing your scraper code as well? It might help us debug.
I also have the warning Consider increasing available memory.
, is there any way for the user to control the memory allocation?
I also have the warning
Consider increasing available memory.
, is there any way for the user to control the memory allocation?
Unless you're limiting the memory usage knowingly, no, there isn't, at least without digging deep in Crawlee's internals. Of course, if you're working with a cloud platform such as Apify, you can configure the available memory there.
Thank you for your response :D
Hello, I'm experiencing performance issues with my web crawler after approximately 1.5 to 2 hours of runtime. The crawling speed significantly decreases to about one site per minute or less, and I'm encountering numerous timeout errors.
Questions:
Here is the code I use:
The logs and errors: