alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

Switch washingtontimes to sitemap crawl and add extra date format #356

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Closes #198

Only potential downside of this change is that we are no longer only getting the politics category. We were previously only crawling ~60 articles, but now crawler running indefinitely, I stopped at 485:

2019-08-06 11:04:29     INFO: Processed 485 pages in 0:02:33.039399 => 3.17 Hz
2019-08-06 11:04:29     INFO: Found articles in 485/485 pages => 100.00%
2019-08-06 11:04:29     INFO: ... of these 0/485 had no date => 0.00%
2019-08-06 11:04:29     INFO: ... of these 290/485 had no byline => 59.79%
2019-08-06 11:04:29     INFO: ... of these 0/485 had no title => 0.00%
2019-08-06 11:04:29     INFO: Including skipped pages, there are articles in 485/485 pages => 100.00%

I've checked and many articles have no bylines

edwardchalstrey1 commented 5 years ago

Decided not to merge this as we want to keep politics only