Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

HierarchyTagger does not add separator #91

Closed tendres1980 closed 4 years ago

tendres1980 commented 5 years ago

First of all: thanks for the great product! It is a pleasure to use the Norconex crawler!

I have found an issue with the HierarchyTagger which does not include the specified separator to the generated field-values. For example setting up the HierarchyTagger to the document URL with the "/" as a separator and the URL "http://example.com/foo/bar", results in the values: "http:", "http:", "http:example.com", "http:example.comfoo", "http:example.comfoobar". Looking at the code for the HierarchyTagger I found that the Apache StringUtils are used to tokenize the URL, and the code depends on the separator being present in the tokens. However, the documentation for the StringUtils says that separator is not included in the resulting String array: https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#splitByWholeSeparatorPreserveAllTokens-java.lang.String-java.lang.String- so I believe this to be the problem.

essiembre commented 5 years ago

Not sure why it went unreported for so long. Definitely a problem. Will fix.

essiembre commented 5 years ago

A fix is now available in the latest importer snapshot (which also made it to latest FS and HTTP Collectors). Also "keepEmptySegments" and "regex" (boolean) flags were added.