Closed sebastian-nagel closed 7 years ago
I was thinking about having a URLFilteringBolt that could be used in front of a status updater. This would be useful when injecting but also for ppl using many instances of SimpleFetcherBolts where each one of them has its own copy of the URL filters.
Good idea, for the injector a minimal filter which just makes sure that the URL string is a valid URL and is preserved as is would be fine. It's hard to figure out what's going wrong if the key is changed on its way through the topology.
Thanks! Verified that URLs are filtered and normalized during injection.
If the round trip conversion String <> java.net.URL yields a different URL string, the crawl topology fails to properly update the status of fetched items. This happens if injected URLs contain trailing white space (cf. commoncrawl/news-crawl#16), but may also affect
file:///
URLs (cf. NUTCH-1483).One solution could be to consequently apply the conversion String > URL > String, esp. in the injectors, or to reject all URLs which would otherwise cause troubles.