Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

StripBeforeTransformer not emitting correctly? #111

Open svanschalkwyk opened 4 years ago

svanschalkwyk commented 4 years ago

I have a field such as the one below:

<meta name="p_temp_id">/ip/Bolthouse-Farms-Organics-Premium-Matchstix-Julienne-Carrots-10-oz/44933639</meta>

With the configuration below, I expect "44933639" to be written back to the same field. Instead, the original field value is returned.

<transformer class="com.norconex.importer.handler.transformer.impl.StripBeforeTransformer" inclusive="true" caseSensitive="false">
    <restrictTo caseSensitive="false" field="p_temp_id">.*</restrictTo>
    <stripBeforeRegex>\d{6,}$</stripBeforeRegex>
</transformer>

when:
essiembre commented 4 years ago

Transformers apply on content. To deal with fields, use a tagger. Like the ReplaceTagger.