alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

add us news and world news categories realnewsrightnow #348

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

By adding these categories we go from 84 to 319 articles:

2019-08-05 10:28:21     INFO: Processed 319 pages in 0:01:32.819360 => 3.47 Hz
2019-08-05 10:28:21     INFO: Found articles in 319/319 pages => 100.00%
2019-08-05 10:28:21     INFO: ... of these 0/319 had no date => 0.00%
2019-08-05 10:28:21     INFO: ... of these 0/319 had no byline => 0.00%
2019-08-05 10:28:21     INFO: ... of these 0/319 had no title => 0.00%
2019-08-05 10:28:21     INFO: Including skipped pages, there are articles in 319/319 pages => 100.00%
edwardchalstrey1 commented 5 years ago

I think there's a better way to specify this in XPath syntax that lets you do something like:

contains(@class, "category-(politics | world-news | u-s-news)")

It's not exactly this, but can you take a look?

Yes managed to simplify it 👍