Closed niels closed 8 years ago
This is already supported. You have to enabled it in the URL normalization phase. Have a look at GenericURLNormalizer. You can enable it by adding removeDotSegments
like this:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<normalizations>
removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters,
removeDotSegments
</normalizations>
</urlNormalizer>
The other normalizations in the above example are the default ones always applied.
The reason removeDotSegments
is not part of the defaults is because there is no guarantee it will not change URL semantics for all websites out there.
Please try with the above config snippet and let me know.
Ah yes, of course. I have the normalizations in all my production configs but was playing around with minimalistic test configs when I ran into this (non-) issue.
Many thanks for the prompt sanity check!
The crawler does not seem to turn URLs including
/../
segments (and probably also including/./
segments) into absolute / normalized URLs. This leads to duplication in the queue and committer sink.E.g. I would expect a reference such as
http://example.com/some-dir/dir2/../file
to consistently be referred to ashttp://example.com/some-dir/file
by the crawler. The latter is the URL that both should be put in the queue as well as used as the document reference (ID) when committing.In actuality however, I see
http://example.com/some-dir/dir2/../file
being used throughout.Example config:
Note that https://herimedia.com/norconex-test/this.is.not.a.file/test contains three links with
/../
segments:As a result, the crawler e.g.
norconex-test/this.is.not.a.file/test
.norconex-test/this.is.not.a.file/../../
repetitions until the crawler gets stopped.Instead what should happen is that in (1) the crawler right away turns https://herimedia.com/norconex-test/this.is.not.a.file/../../norconex-test.html into https://herimedia.com/norconex-test.html even before queueing.
See here for a sample log.