Closed tendres1980 closed 4 years ago
Not sure why it went unreported for so long. Definitely a problem. Will fix.
A fix is now available in the latest importer snapshot (which also made it to latest FS and HTTP Collectors). Also "keepEmptySegments" and "regex" (boolean) flags were added.
First of all: thanks for the great product! It is a pleasure to use the Norconex crawler!
I have found an issue with the HierarchyTagger which does not include the specified separator to the generated field-values. For example setting up the HierarchyTagger to the document URL with the "/" as a separator and the URL "http://example.com/foo/bar", results in the values: "http:", "http:", "http:example.com", "http:example.comfoo", "http:example.comfoobar". Looking at the code for the HierarchyTagger I found that the Apache StringUtils are used to tokenize the URL, and the code depends on the separator being present in the tokens. However, the documentation for the StringUtils says that separator is not included in the resulting String array: https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/StringUtils.html#splitByWholeSeparatorPreserveAllTokens-java.lang.String-java.lang.String- so I believe this to be the problem.