Closed edwardchalstrey1 closed 5 years ago
Yes, this was from multiple crawls. You could have a look for a sitemap (usually a better bet than a scattergun crawl) but if there's not one, then let's stick with what we've got.
Sitemap looks to be the best bet! Fairly low percentage of links are articles, but we do go well beyond 132 if we leave crawler running. I stopped the crawler manually at 1295:
2019-08-05 16:29:52 INFO: Processed 1295 pages in 0:01:07.647171 => 19.33 Hz
2019-08-05 16:29:52 INFO: Found articles in 257/1295 pages => 19.85%
2019-08-05 16:29:52 INFO: ... of these 0/257 had no date => 0.00%
2019-08-05 16:29:52 INFO: ... of these 102/257 had no byline => 39.69%
2019-08-05 16:29:52 INFO: ... of these 0/257 had no title => 0.00%
2019-08-05 16:29:52 INFO: Including skipped pages, there are articles in 257/1295 pages => 19.85%
LGTM (by the way, their article IDs are just 32 alphanumeric characters?)
yes e.g. https://www.apnews.com/45309e99d09e438a8b5f329f73ac7850
Crawling this site I only got 49 articles, less that 544 suggested by #347
Tested with scattergun and now get 132, with some lacking bylines on the site:
@jemrobinson what do you think, is this a change worth making? Is the 544 you have in the database a result of multiple crawls?