Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

DomTagger does not respect overwrite=true #90

Closed ohtwadi closed 4 years ago

ohtwadi commented 5 years ago

Hello,

I wanted to set the title to a specific div, but noticed that using the DOMTagger with overwrite="true" actually appends the new value to the existing value in title field.

<preParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
    <dom selector="div#ArticleDetailTitle" toField="title" overwrite="true"/>
  </tagger>
</preParseHandlers>

I'm working around this by copying to a temp field and then copying that temp field to the actual title field. CopyTagger respects overwrite=true just fine.

<postParseHandlers>
  <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
    <dom selector="div#ArticleDetailTitle" toField="title-temp" overwrite="true"/>
  </tagger>

  <tagger class="com.norconex.importer.handler.tagger.impl.CopyTagger">
    <copy fromField="title-temp"    toField="title" overwrite="true" />
  </tagger>
</postParseHandlers>
essiembre commented 5 years ago

I suspect it works, but a title gets added when the document is parsed and a title is extracted (adding to the one you created with DOMTagger. Your workaround is a good one, or you can also look at ForceSingleValueTagger as a post-parse handler.