apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
889 stars 262 forks source link

Protocol-okhttp: implement IP filter #1107

Open jnioche opened 1 year ago

jnioche commented 1 year ago

See NUTCH-2930

In order to avoid information leakage to a public search index or web archive, it should be possible to configure Nutch in a way that no content is fetched from localhost, loop-back addresses, private address spaces.

NUTCH-2527 adds the configuration snippets to exclude URLs pointing to private addresses.

However, filtering URLs isn't enough because a DNS entry of an arbitrary host name may point to a private IP address. Blocking must happen on the protocol level because the IP address is only know in the protocol implementation. I'll add an implementation for protocol-okhttp.

rzo1 commented 1 year ago

Sounds useful. Might also be useful to add adresses dynamically during a crawl in order to deal with abuse requests, etc.

jnioche commented 1 year ago

NUTCH-2527 -> #543

jnioche commented 1 year ago

https://github.com/apache/nutch/pull/736/files