disinfoRG / ZeroScraper

Web scraper made by 0archive.
https://0archive.tw
MIT License
10 stars 2 forks source link

In-memory filter to exclude duplication articles #79

Closed pm5 closed 4 years ago

pm5 commented 4 years ago

This implements the first solution in #41 for articles and dcard posts. I believe we can trade-in some memory space to save CPU time. Pre-seeding the deduper with 200 recent articles seems reasonable. This number can be changed when getting recent_articles in newsSpiders.runner.discover.

2b42cbf to 6fb894e moves all db queries out of the update spiders. This should speed up update crawling a bit because our db operations are blocking. Db queries in discover spiders were moved out in #36. This is not needed to implement dedup mechanism.

pm5 commented 4 years ago

After a few rounds of tests I raised the number to 500 articles because the memory footprint is quite small.

andreawwenyi commented 4 years ago

@pm5 +++ I'm wondering if we could remove line 49-55 in runner/update.py and have get_site_to_update.sql include 'Article_type = Dcard' (or don't have the article_type filter at all, since we don't have any fb posts anymore). I feel like it's a little bit complicated in runner/update.py where for Dcard we only open 1 spider and the rest are 1 spider per site.

pm5 commented 4 years ago

You're right, we can do just that.