Endless refetch of URLs due to changing domain names

sebastian-nagel commented 5 years ago

The news crawler uses the domain name to manage fetch queues. The domain name is also used to route URLs to Elasticsearch shards. When a URL is re-fetched the existing routing key isn't reused, instead the domain name is newly extracted from the host name and used as routing key. This makes the routing unstable because the domain name extraction is based on the changing and continuously updated public suffix list. If the routing changes the status record doesn't get updated, instead a second record with the same key is created. Because the nextFetchDate of the original record is still in the past and is never updated, the URL is scheduled for re-fetch again and again.

Two examples (the domain name in the updated version is the correct one):

domain la.lv (la.lv is not a public suffix, so veselam.la.lv is not the domain name)

% ./bin/es_status url http://veselam.la.lv/feed
{                                                                                                                                                             
"took" : 85,                                                                                                                                                
"timed_out" : false,                                                                                                                                                      
"_shards" : {                                                                                                                                                             
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 16.336418,
"hits" : [
  {
    "_index" : "status",
    "_type" : "status",
    "_id" : "0ffe9ec78013060e06b5ee955058ce2d42617af4a4e287660d33661797bacc05",
    "_score" : 16.336418,
    "_routing" : "veselam.la.lv",
    "_source" : {
      "url" : "http://veselam.la.lv/feed",
      "status" : "ERROR",
      "metadata" : {
        "error%2Ecause" : [
          "maxFetchErrors"
        ],
        "depth" : [
          "1"
        ],
        "fetch%2EstatusCode" : [
          "404"
        ],
        "hostname" : "veselam.la.lv"
      },
      "nextFetchDate" : "2019-02-16T13:11:28.794Z"
    }
  },
  {
    "_index" : "status",
    "_type" : "status",
    "_id" : "0ffe9ec78013060e06b5ee955058ce2d42617af4a4e287660d33661797bacc05",
    "_score" : 15.227517,
    "_routing" : "la.lv",
    "_source" : {
      "url" : "http://veselam.la.lv/feed",
      "status" : "FETCH_ERROR",
      "metadata" : {
        "error%2Ecause" : [
          "maxFetchErrors"
        ],
        "depth" : [
          "1"
        ],
        "fetch%2Eerror%2Ecount" : [
          "1"
        ],
        "fetch%2EstatusCode" : [
          "404"
        ],
        "hostname" : "la.lv"
      },
      "nextFetchDate" : "2019-02-18T11:39:33.000Z"
    }
  }
]
}
}

domain sportmediaset.med - the top-level domain (also a public suffix) .med has been introduced recently in 2016.

% ./bin/es_status url http://www.sportmediaset.med
{
"took" : 68,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 16.015972,
"hits" : [
  {
    "_index" : "status",
    "_type" : "status",
    "_id" : "313cb24e75fd9ab5f3e6bc4afbd66cb067de1d375c69fe49bfeabdf3df7f7372",
    "_score" : 16.015972,
    "_routing" : "www.sportmediaset.med",
    "_source" : {
      "url" : "http://www.sportmediaset.med",
      "status" : "ERROR",
      "metadata" : {
        "error%2Ecause" : [
          "maxFetchErrors"
        ],
        "depth" : [
          "2"
        ],
        "isSitemap" : [
          "false"
        ],
        "isSitemapNews" : [
          "false"
        ],
        "hostname" : "www.sportmediaset.med"
      },
      "nextFetchDate" : "2019-02-09T05:58:20.006Z"
    }
  },
  {
    "_index" : "status",
    "_type" : "status",
    "_id" : "313cb24e75fd9ab5f3e6bc4afbd66cb067de1d375c69fe49bfeabdf3df7f7372",
    "_score" : 14.726991,
    "_routing" : "sportmediaset.med",
    "_source" : {
      "url" : "http://www.sportmediaset.med",
      "status" : "FETCH_ERROR",
      "metadata" : {
        "error%2Ecause" : [
          "maxFetchErrors"
        ],
        "depth" : [
          "2"
        ],
        "isSitemap" : [
          "false"
        ],
        "isSitemapNews" : [
          "false"
        ],
        "fetch%2Eerror%2Ecount" : [
          "1"
        ],
        "hostname" : "sportmediaset.med"
      },
      "nextFetchDate" : "2019-02-18T11:43:49.000Z"
    }
  }
]
}
}

sebastian-nagel commented 5 years ago

Opened DigitalPebble/storm-crawler#684. For now all "hanging" items are removed from the status index. 350 domains (or hosts previously identified as separate domains) are affected. Luckily the issue only affects re-fetching of feeds and sitemaps, and a few re-fetches of URLs which have been fetched first shortly before the last update which included a significant improvement in the the domain name extraction (crawler-commons/crawler-commons#183). The biggest group is the domain kp.ru which was erroneously split into 50 "domains" - "www.kp.ru", "saratov.kp.ru", etc.

sebastian-nagel commented 5 years ago

Fix deployed and status index reindexed together with upgrade to StormCrawler 1.14 and Elasticsearch 7.0

commoncrawl / news-crawl

Endless refetch of URLs due to changing domain names #28