Closed andreawwenyi closed 4 years ago
This is good. I even think we should change our update routine from running update for all articles in one spider, to have multiple spiders, each is responsible for updating one site, like what we have in the discover routine. The advantages are:
CONCURRENT_REQUESTS=1
projectwise. Since we run a distinct spider for each site in discover routine, this limits our concurrent requests to one request per site, but all sites are still crawled in parallel. This is ideal in the case of discover routine. However, this means our update routine is considerably slower than it can be because there is no sidewise parallelism.Update:
update 1 article is implemented in article.py
and can be executed as $python article.py update {article_id}
update articles from 1 site is implemented in site.py
and can be executed as $python site.py update {site_id}
(#57)
Currently, the
update_content_spider
will search for all articles that needed another snapshot. Some enhancements would be nice:article_id
site_id