Norconex / collector-core

Collector-related code shared between different collector implementations
http://www.norconex.com/collectors/collector-core/
Apache License 2.0
7 stars 15 forks source link

[Bugfix] When checking references for file extensions, only consider extensions at the end of the reference #4

Closed niels closed 8 years ago

niels commented 8 years ago

Previously. a reference such as http://example.com/some.dir/file would have matched a dir/file extension. Furthermore, extensions were looked for in the entire URL. For example,

  <referenceFilters>
    <filter
      class="${filterExtension}"
      onMatch="exclude"
      caseSensitive="false"
    >com</filter>
  </referenceFilters>
  […]
  <startUrls>
    <url>http://example.com</url>
  </startUrl>

Would have meant that the crawl immediately finishes as http://example.com would have matched the .com exclusion pattern.

This patch makes three changes to fix the extension filtering behaviour:

  1. Extensions are only taken into account when they occur within the path of the full reference URL.
  2. Extensions must occur at the end of the path.
  3. Extensions must only contain (ASCII) letters, digits, and dots. (The latter allowing for example.subtype.xml-style filenames.)

This fixes #2.