commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Endless refetch of URLs due to changing domain names #28

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

The news crawler uses the domain name to manage fetch queues. The domain name is also used to route URLs to Elasticsearch shards. When a URL is re-fetched the existing routing key isn't reused, instead the domain name is newly extracted from the host name and used as routing key. This makes the routing unstable because the domain name extraction is based on the changing and continuously updated public suffix list. If the routing changes the status record doesn't get updated, instead a second record with the same key is created. Because the nextFetchDate of the original record is still in the past and is never updated, the URL is scheduled for re-fetch again and again.

Two examples (the domain name in the updated version is the correct one):

sebastian-nagel commented 5 years ago

Opened DigitalPebble/storm-crawler#684. For now all "hanging" items are removed from the status index. 350 domains (or hosts previously identified as separate domains) are affected. Luckily the issue only affects re-fetching of feeds and sitemaps, and a few re-fetches of URLs which have been fetched first shortly before the last update which included a significant improvement in the the domain name extraction (crawler-commons/crawler-commons#183). The biggest group is the domain kp.ru which was erroneously split into 50 "domains" - "www.kp.ru", "saratov.kp.ru", etc.

sebastian-nagel commented 5 years ago

Fix deployed and status index reindexed together with upgrade to StormCrawler 1.14 and Elasticsearch 7.0