Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

SplitTagger Retains original String [Importer 2.9.0] #97

Closed jshifrin25 closed 4 years ago

jshifrin25 commented 5 years ago

I would like to use the SplitTagger to replace a metadata field as multiple values and remove the original comma delimited string. Currently, the SplitTagger appends the array of strings to a list containing the original String. I would like there to be an option to remove the original string from the list.

essiembre commented 5 years ago

I cannot reproduce. Which version are you using? Can you send me a sample config that has the minimum settings to reproduce your issue?

jshifrin25 commented 5 years ago

I am way using version 2.9.0 of the importer with the following in the configuration.

<tagger class="${taggerBase}.impl.DOMTagger">
                    <restrictTo field="document.contentType">text/html</restrictTo>
                    <dom selector="meta#MetaKeywords" toField="keywords" extract="attr(content)" overwrite="true"/>   
                </tagger>

<tagger class="${taggerBase}.impl.SplitTagger">
                    <split fromField="keywords"
                           regex="true">
                        <separator>\s*,\s*</separator>
                    </split>
                </tagger>
essiembre commented 5 years ago

I tried reproducing again without success. Can you please attach an HTML causing the issue for you?