commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

URLs with trailing white space continuously re-fetched #16

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

Certain URLs following a quite specific pattern are continuously re-refetched, here the counts from a couple of hours log files:

    669 http://www.tvl.be/,NLD,Leuven
    668 http://www.ltv.ly/,ARA,National
    667 http://www.rajdhani.com.np/,NEP,National
    665 http://ukrainian.voanews.com/,UKR,Nationwide
    665 http://www.ura-inform.com/,RUS,Nationwide
    662 http://lariviera.netweek.it/,ITA,Sanremo
    458 http://www.topix.com/world/burkina-faso/,ENG,Foreign

The status index of these URLs seems correct, e.g.:

      "_source" : {
        "url" : "http://www.topix.com/world/burkina-faso/,ENG,Foreign",
        "status" : "FETCHED",
        "metadata" : {
          "fetch%2EstatusCode" : [ "200" ]
        },
        "hostname" : "topix.com",
        "nextFetchDate" : "2027-01-31T21:18:49.092Z"
      }

or

      "_source" : {
        "url" : "http://www.ura-inform.com/,RUS,Nationwide",
        "status" : "REDIRECTION",
        "metadata" : {
          "_redirTo" : [ "http://ura-inform.com" ],
          "fetch%2EstatusCode" : [ "301" ]
        },
        "hostname" : "ura-inform.com",
        "nextFetchDate" : "2027-01-31T21:22:09.213Z"
      }
john-hewitt commented 7 years ago

I would look at the seed URLs from #12 ; this looks like the 3-field CSV of [URL,Language,Region] was treated as a single URL.

sebastian-nagel commented 7 years ago

Thanks, @john-hewitt! That explains how these URLs went into the crawl. A quick fix would be to just remove them.

However, they are currently re-fetched every 3-5 minutes which is definitely a bug. Better to analyze while it's hot. It's an unnecessary waste of resources, and what's worse, some of them are repeatedly written into the WARC files.

sebastian-nagel commented 7 years ago

Ok, got it: it's a record with a trailing space

      "_source" : {
        "url" : "http://www.topix.com/world/burkina-faso/,ENG,Foreign ",
        "status" : "DISCOVERED",
        "metadata" : { },
        "hostname" : "topix.com",
        "nextFetchDate" : "2016-12-21T16:14:40.784Z"
      }

The trailing is lost in the topology when FetcherBolt converts the URL string into a java.net.URL object. In consequence, the original record never gets updated.

sebastian-nagel commented 7 years ago

Removed all troublesome URLs from Elasticsearch.