make update_content_spider more flexible

andreawwenyi commented 4 years ago

Currently, the update_content_spider will search for all articles that needed another snapshot. Some enhancements would be nice:

[x] easily update 1 article, specified by article_id
[x] easily update articles from a site, specified by site_id

pm5 commented 4 years ago

This is good. I even think we should change our update routine from running update for all articles in one spider, to have multiple spiders, each is responsible for updating one site, like what we have in the discover routine. The advantages are:

Parallelism. Currently we set Scrapy to CONCURRENT_REQUESTS=1 projectwise. Since we run a distinct spider for each site in discover routine, this limits our concurrent requests to one request per site, but all sites are still crawled in parallel. This is ideal in the case of discover routine. However, this means our update routine is considerably slower than it can be because there is no sidewise parallelism.
It probably makes more sense to split update tasks by sites because many factors in the update routine are still site-dependent, e.g Selenium, delay, PTT & Dcard, etc. This is the same reason we split discover tasks by sites.

andreawwenyi commented 4 years ago

Update: update 1 article is implemented in article.py and can be executed as $python article.py update {article_id} update articles from 1 site is implemented in site.py and can be executed as $python site.py update {site_id} (#57)

disinfoRG / ZeroScraper

make update_content_spider more flexible #49