alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

Update dailykos match rules #343

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Closes #326

I updated the byline match rule but the test I made also required updating the content to "group". Whilst this meant editing test 1, it's only the <div> wrapping that has changed, the article content is the same.

Tested with crawler up to 300:

2019-07-31 12:19:04     INFO: Found articles in 300/300 pages => 100.00%
2019-07-31 12:19:04     INFO: ... of these 0/300 had no date => 0.00%
2019-07-31 12:19:04     INFO: ... of these 0/300 had no byline => 0.00%
2019-07-31 12:19:04     INFO: ... of these 0/300 had no title => 0.00%
2019-07-31 12:19:04     INFO: Including skipped pages, there are articles in 300/300 pages => 100.00%