Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
32 stars 23 forks source link

Create a document tagger that extract both field values and names. #52

Closed essiembre closed 7 years ago

essiembre commented 7 years ago

Create a tagger that allows extracting both the field names and field values as pairs.

The following is a use case describing the requirement, reported here: https://github.com/Norconex/collector-http/issues/372#issuecomment-322978476.

Assume the following in a page:

<div class="field">First Name</div>
<div class="value">Joe</div>
<div class="field">Last Name</div>
<div class="value">Dalton</div>

The tagger should be able to extract these fields:

First Name = Joe
Last Name = Dalton
essiembre commented 7 years ago

This feature is now available in latest importer snapshot release.

The TextPatternTagger now supports an optional fieldGroup attribute that tells which regular expression match group to use for the field name. For instance, the above use case can be addressed like this:

  <preParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
      <pattern fieldGroup="1" valueGroup="2"><![CDATA[
        <div.*?class="field".*?>(.*?)</div>.*?<div.*?class="value">(.*?)</div>
      ]]></pattern>
    </tagger>  
  </preParseHandlers>

@akshaybijawe and/or @OkkeKlein, can you please test and confirm (as this new option should address both https://github.com/Norconex/collector-http/issues/372 and #54)?

For command-line usage, I recommend you use the install script since there might be other jars updated in the process.

OkkeKlein commented 7 years ago

I tested it and it works.

essiembre commented 7 years ago

Great, thanks!

fleitonSearch commented 7 years ago

Hi Pascal, I'm trying to run it but I'm getting this error:

ERROR (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'fieldGroup' is not allowed to appear in element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'fieldGroup' is not allowed to appear in element 'pattern'.
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.4: Attribute 'field' must appear on element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.4: Attribute 'field' must appear on element 'pattern'.
There were 3 XML configuration error(s).`

And this is my config file (attached).

I'm using the latest snapshot version as you mentioned. I downloaded it this morning.

Any idea?

essiembre commented 7 years ago

I just downloaded it and do not see this problem. Did you download the importer snapshot (and not a Collector). Also, did you run the install script? Maybe you have two versions of the importer in your classpath and the older one gets picked up? Look in the "lib" folder for norconex-importer-*.

fleitonSearch commented 7 years ago

It worked fine I was doing the install in the wrong path. Sorry about that. Thanks! :)