alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

more articles from apnews with sitemap #352

Closed edwardchalstrey1 closed 5 years ago

edwardchalstrey1 commented 5 years ago

Crawling this site I only got 49 articles, less that 544 suggested by #347

Tested with scattergun and now get 132, with some lacking bylines on the site:

2019-08-05 11:55:23     INFO: Finished processing 132/132: https://apnews.com/503d7890ff8344dea884616e6cb20dd1
2019-08-05 11:55:23     INFO: Processed 132 pages in 0:00:13.862714 => 10.15 Hz
2019-08-05 11:55:23     INFO: Found articles in 132/132 pages => 100.00%
2019-08-05 11:55:23     INFO: ... of these 0/132 had no date => 0.00%
2019-08-05 11:55:23     INFO: ... of these 28/132 had no byline => 21.21%
2019-08-05 11:55:23     INFO: ... of these 0/132 had no title => 0.00%
2019-08-05 11:55:23     INFO: Including skipped pages, there are articles in 132/132 pages => 100.00%

@jemrobinson what do you think, is this a change worth making? Is the 544 you have in the database a result of multiple crawls?

edwardchalstrey1 commented 5 years ago

Yes, this was from multiple crawls. You could have a look for a sitemap (usually a better bet than a scattergun crawl) but if there's not one, then let's stick with what we've got.

Sitemap looks to be the best bet! Fairly low percentage of links are articles, but we do go well beyond 132 if we leave crawler running. I stopped the crawler manually at 1295:

2019-08-05 16:29:52     INFO: Processed 1295 pages in 0:01:07.647171 => 19.33 Hz
2019-08-05 16:29:52     INFO: Found articles in 257/1295 pages => 19.85%
2019-08-05 16:29:52     INFO: ... of these 0/257 had no date => 0.00%
2019-08-05 16:29:52     INFO: ... of these 102/257 had no byline => 39.69%
2019-08-05 16:29:52     INFO: ... of these 0/257 had no title => 0.00%
2019-08-05 16:29:52     INFO: Including skipped pages, there are articles in 257/1295 pages => 19.85%
edwardchalstrey1 commented 5 years ago

LGTM (by the way, their article IDs are just 32 alphanumeric characters?)

yes e.g. https://www.apnews.com/45309e99d09e438a8b5f329f73ac7850