fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
1.99k stars 414 forks source link

Configure options to optimize the crawling and extraction process #232

Closed kvasilopoulos closed 1 month ago

kvasilopoulos commented 1 year ago

Hi, I would like to know how can I configure news-please options to optimize the crawling and the extraction process. For example, let's assume that we have a machine with 4 CPUs (2 threads per CPU) and we have 20 websites to crawl from, what is the optimal number of number_of_parallel_daemons, number_of_parallel_crawlers and CONCURRENT_REQUESTS_PER_DOMAIN.

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)