apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
878 stars 258 forks source link

Seed injectors to normalize String > URL > String #421

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

If the round trip conversion String <> java.net.URL yields a different URL string, the crawl topology fails to properly update the status of fetched items. This happens if injected URLs contain trailing white space (cf. commoncrawl/news-crawl#16), but may also affect file:/// URLs (cf. NUTCH-1483).

One solution could be to consequently apply the conversion String > URL > String, esp. in the injectors, or to reject all URLs which would otherwise cause troubles.

jnioche commented 7 years ago

I was thinking about having a URLFilteringBolt that could be used in front of a status updater. This would be useful when injecting but also for ppl using many instances of SimpleFetcherBolts where each one of them has its own copy of the URL filters.

sebastian-nagel commented 7 years ago

Good idea, for the injector a minimal filter which just makes sure that the URL string is a valid URL and is preserved as is would be fine. It's hard to figure out what's going wrong if the key is changed on its way through the topology.

sebastian-nagel commented 7 years ago

Thanks! Verified that URLs are filtered and normalized during injection.