Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Doesn't "absolutify" URLs with /../ segments #224

Closed niels closed 8 years ago

niels commented 8 years ago

The crawler does not seem to turn URLs including /../ segments (and probably also including /./ segments) into absolute / normalized URLs. This leads to duplication in the queue and committer sink.

E.g. I would expect a reference such as http://example.com/some-dir/dir2/../file to consistently be referred to as http://example.com/some-dir/file by the crawler. The latter is the URL that both should be put in the queue as well as used as the document reference (ID) when committing.

In actuality however, I see http://example.com/some-dir/dir2/../file being used throughout.

Example config:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="test-collector">
  <crawlers>
    <crawler id="test-crawler">
      <startURLs>
        <url>https://herimedia.com/norconex-test/this.is.not.a.file/test</url>
      </startURLs>
    </crawler>
  </crawlers>
</httpcollector>

Note that https://herimedia.com/norconex-test/this.is.not.a.file/test contains three links with /../ segments:

<!DOCTYPE html>
<html lang="en" class=" is-copy-enabled">
  <head>
    <title>Norconex test</title>
  </head>

  <body>
    Norconex HTTP Collector Test Site
    <a href="../../norconex-test-large.html">Charset Test Large</a>
    <a href="../../norconex-test.html">Charset Test Small</a>
    <a href="../test">No Extension Test</a>
  </body>
</html>

As a result, the crawler e.g.

  1. queues https://herimedia.com/norconex-test/this.is.not.a.file/../../norconex-test.html.
  2. This file links back to norconex-test/this.is.not.a.file/test.
  3. The crawler then interprets that as https://herimedia.com/norconex-test/this.is.not.a.file/../../norconex-test/this.is.not.a.file/test. This file has previously already been crawled but that isn't recognized as the URL is not properly normalized.
  4. Now it gets even crazier because the crawler processes the links again and now e.g. queues https://herimedia.com/norconex-test/this.is.not.a.file/../../norconex-test/this.is.not.a.file/../../norconex-test.html.
  5. This will repeat indefinitely with more and more norconex-test/this.is.not.a.file/../../ repetitions until the crawler gets stopped.

Instead what should happen is that in (1) the crawler right away turns https://herimedia.com/norconex-test/this.is.not.a.file/../../norconex-test.html into https://herimedia.com/norconex-test.html even before queueing.

See here for a sample log.

essiembre commented 8 years ago

This is already supported. You have to enabled it in the URL normalization phase. Have a look at GenericURLNormalizer. You can enable it by adding removeDotSegments like this:

  <urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
    <normalizations>
        removeFragment, lowerCaseSchemeHost, upperCaseEscapeSequence,
        decodeUnreservedCharacters, removeDefaultPort, encodeNonURICharacters,
        removeDotSegments
    </normalizations>
  </urlNormalizer>

The other normalizations in the above example are the default ones always applied.

The reason removeDotSegments is not part of the defaults is because there is no guarantee it will not change URL semantics for all websites out there.

Please try with the above config snippet and let me know.

niels commented 8 years ago

Ah yes, of course. I have the normalizations in all my production configs but was playing around with minimalistic test configs when I ran into this (non-) issue.

Many thanks for the prompt sanity check!