Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

using ScriptTransformer with a tagged field #34

Closed angelo337 closed 8 years ago

angelo337 commented 8 years ago

hi there I am having an issue with a transformation I am trying to put in place

after I capture some information from the content field like this:

<pattern field="consorcios_UT" >
  ((([U|u](NION|nion)\s[T|t](EMPORAL|emporal))|([C|c](ONSORCIO|onsorcio)))\s)+(([A-Z][\p{L}]*[\p{S}]*)+\s)
</pattern>

I am trying to use this field (consorcios_UT) in a script like this:

<transformer class="com.norconex.importer.handler.transformer.impl.ScriptTransformer">
    <script><![CDATA[
        consorcios_UT = consorcios_UT.replace(/consorcio/g, 'consorcio');
        /*return*/ consorcios_UT;
        ]]></script>
</transformer>

but I get and error saying that is not such a field as "consorcios_UT" defined.

my question how should I process a field that is just created in order to normalize all information and avoid case miss match or strip spaces or punctuation signs?

thanks a lot angelo

essiembre commented 8 years ago

The message is accurate, "consorcios_UT" is not a variable. Refer to the ScriptTransformer documentation to find out what variables are made available. In your case, you have to grab your value from the "metadata" variable. Usage could look like this:

// to obtain a field value:
consorcios_UT = metadata.getString('consorcios_UT');

// do what you need with your variable here

// to store a new field value:
metadata.setString('consorcios_UT', consorcios_UT);

Also, there might be chances you can use existing tagger classes to do what you want without having to use scripting. For instance, to do a search and replace on a field, you can use the ReplaceTagger.

angelo337 commented 8 years ago

thanks a lot for your help, It's working now I am on the right track to test my corpus best regards