The URL filters should reject localhost and private address spaces. The crawler may detect links pointing to a private network address, e.g.
2017-12-23 08:37:22.104 c.d.s.b.FetcherBolt FetcherThread #54 [ERROR] Exception while fetching http://localhost/wordpress/2017/.../
org.apache.http.conn.HttpHostConnectException: Connect to localhost:80 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
This example looks more like an error on the remote page. But the crawler should never even try to access pages from localhost or a private network to avoid that information is leaked and is written to the WARC file. Could be, e.g., a link to the Storm web interface (http://localhost:8080/) exposing the cluster configuration.
The URL filters should reject localhost and private address spaces. The crawler may detect links pointing to a private network address, e.g.
This example looks more like an error on the remote page. But the crawler should never even try to access pages from localhost or a private network to avoid that information is leaked and is written to the WARC file. Could be, e.g., a link to the Storm web interface (http://localhost:8080/) exposing the cluster configuration.