disinfoRG / ZeroScraper

Web scraper made by 0archive.
MIT License
10 stars 2 forks source link

seems to not crawling enough articles from news websites #88

Closed andreawwenyi closed 4 years ago

andreawwenyi commented 4 years ago
As of 3/14/2020, 雪花新聞 are the majority in our db, and article_count in news websites seem too few. site_id site_name article count
119 雪花新聞 338484
789 雪花新聞 287540
98 Ptt 八卦版 240767
105 聯合新聞網 115265
100 中時電子報 98845
753 中國評論新聞網 89869
99 Ptt 政黑版 88659
106 自由時報 55353
104 蘋果即時 53201
108 芋傳媒 51748
732 文匯報 42307
114 觀察者 41650
107 中央社 36482
102 ETtoday 新聞雲 34801
1 中国台湾网 25548

The record is obtained by the following sql:

select Article.site_id, Site.name, count(Article.article_id) as article_count from Article
inner join Site on Site.site_id = Article.site_id
group by Article.site_id
order by article_count desc
limit 15;
andreawwenyi commented 4 years ago

After some researching the news websites are okay. Here's a notebook for the analysis: https://g0v.hackmd.io/ktWzAepfTB-uJ3zGDt-l9A