Closed essiembre closed 7 years ago
This feature is now available in latest importer snapshot release.
The TextPatternTagger now supports an optional fieldGroup
attribute that tells which regular expression match group to use for the field name. For instance, the above use case can be addressed like this:
<preParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
<pattern fieldGroup="1" valueGroup="2"><![CDATA[
<div.*?class="field".*?>(.*?)</div>.*?<div.*?class="value">(.*?)</div>
]]></pattern>
</tagger>
</preParseHandlers>
@akshaybijawe and/or @OkkeKlein, can you please test and confirm (as this new option should address both https://github.com/Norconex/collector-http/issues/372 and #54)?
For command-line usage, I recommend you use the install script since there might be other jars updated in the process.
I tested it and it works.
Great, thanks!
Hi Pascal, I'm trying to run it but I'm getting this error:
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'fieldGroup' is not allowed to appear in element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'fieldGroup' is not allowed to appear in element 'pattern'.
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.4: Attribute 'field' must appear on element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.4: Attribute 'field' must appear on element 'pattern'.
There were 3 XML configuration error(s).`
And this is my config file (attached).
I'm using the latest snapshot version as you mentioned. I downloaded it this morning.
Any idea?
I just downloaded it and do not see this problem. Did you download the importer snapshot (and not a Collector). Also, did you run the install script? Maybe you have two versions of the importer in your classpath and the older one gets picked up? Look in the "lib" folder for norconex-importer-*
.
It worked fine I was doing the install in the wrong path. Sorry about that. Thanks! :)
Create a tagger that allows extracting both the field names and field values as pairs.
The following is a use case describing the requirement, reported here: https://github.com/Norconex/collector-http/issues/372#issuecomment-322978476.
Assume the following in a page:
The tagger should be able to extract these fields: