Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

FeatureRequest: MergeTagger #32

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi,

For complicated reasons, I need a MergeTagger, that would take arguments: fromField1, fromField2, toField, separator and merge/concatenate value-by-value the contents of sourceField1+sourceField2 into destinationField with optional separation with separator).

Example: EXP_FIRST_NAME=Fabien^|~Albert EXP_LAST_NAME=Coco^|~Rico -- separator=" "--> EXP_NAME=Fabien Coco^|~Albert Rico

I'm currently trying to write my own quick&dirty version, but norconex's importer you would probably benefit from directly including a clean a generic version!

In case number of values in the fromFields differs, it could raise an exception or accept a default value?

The prototype of the configuration would be:

 <tagger class="com.norconex.importer.handler.tagger.impl.MergeTagger">
      <merge fromField1="sourceFieldName" fromField2="sourceFieldName" toField="targetFieldName" separator="sep">
      </merge>
 </tagger>
liar666 commented 8 years ago

Here's a rough draft of my code... Seems to be working, but far from clean/complete :) You're welcome to re-use it!

MergeTagger.txt

essiembre commented 8 years ago

Thanks for sharing!