commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

URL filter: exclude localhost and private addresses #21

Closed sebastian-nagel closed 6 years ago

sebastian-nagel commented 6 years ago

The URL filters should reject localhost and private address spaces. The crawler may detect links pointing to a private network address, e.g.

2017-12-23 08:37:22.104 c.d.s.b.FetcherBolt FetcherThread #54 [ERROR] Exception while fetching http://localhost/wordpress/2017/.../
org.apache.http.conn.HttpHostConnectException: Connect to localhost:80 [localhost/127.0.0.1] failed: Connection refused (Connection refused)

This example looks more like an error on the remote page. But the crawler should never even try to access pages from localhost or a private network to avoid that information is leaked and is written to the WARC file. Could be, e.g., a link to the Storm web interface (http://localhost:8080/) exposing the cluster configuration.

jnioche commented 6 years ago

This would be a good thing to add to the default config for the URL filters generated by the SC archetype.

sebastian-nagel commented 6 years ago

PR is almost ready (I'm testing right now). I'll push it forward. Thanks!