Closed niels closed 8 years ago
By default GenericLinkExtractor now only handle these URL schemes: http, https, and ftp. This can be overwritten, like this:
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
<schemes>http, https, somefunnyone</schemes>
</extractor>
I did not upgrade the TikaLinkExtractor
with the same ability, since people may want to use the Tika implementation for what it is supposed to do out of the box (which seems to extract URIs for all schemas).
This has been added to the latest snapshot.
Because this new logic will extract less links (what we want), I hope it won't cause regression issues for some people.
Confirmed working. Thank you!
Given
A page linking to a
tel:
URI:And the following config:
Expected
The collector should not follow this link – or that of any other schema it can't actually process.
Actual
The collectors tries to follow the
tel:
link.Note the
REJECTED_NOTFOUND: https://herimedia.com/norconex-test/tel:123
message.