Large crawlers fill up the queue and run out of memory when running in sync mode

alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.

https://docs.alephdata.org/developers/memorious

MIT License

311 stars 59 forks source link

Large crawlers fill up the queue and run out of memory when running in sync mode #165

Closed sunu closed 3 years ago

sunu commented 3 years ago

In sync mode, memorious execute tasks linearly in a single thread. Crawlers that loop through large number of entries, put too many tasks on the queue before they can move on to the next stage and pull tasks out of the queue.

This often results in larger crawler running out of memory when running in sync mode.

One solution to this problem is to execute tasks in a depth-first fashion as they come without putting them on the task queue when executing in sync mode.

sunu commented 3 years ago

sync mode is now the default mode of execution and supports multiple threads. If a large crawler is still run in single threaded mode, please make sure to use tail call recursion in the pipeline to avoid queue OOM issues.