Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Attribute 'valueGroup' is not allowed to appear in element 'pattern'. #415

Closed dhildreth closed 7 years ago

dhildreth commented 7 years ago

I'm getting a strange error when attempting to use TextPatternTagger. I'm hoping it's just something I'm doing wrong, but it seems like a strange one. My goal is to extract a thumbnail image for the page to display in search results. These images are identified by itemprop="image" followed by the URL source. I'd like to save to the field called thumbnail_url. I have it in my preParseHandlers section.

<tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger" >
    <pattern field="thumbnail_url" valueGroup="1"><![CDATA[itemprop="image" src="(.*?)">]]></pattern>
</tagger>

A quick check of the config file shows me some errors about the TextPatternTagger:

derek@solr:/var/norconex-collector-http-2.7.1$ ./collector-http.sh --checkcfg  -c myredacteddomain-config.xml
INFO  [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
INFO  [AbstractCollectorConfig] Configuration loaded: id=MySite Config HTTP Collector; logsDir=./myredacteddomain-output/logs; progressDir=./myredacteddomain-output/progress
There were 1 XML configuration error(s).

I'm referring to the documentation here: http://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/TextPatternTagger.html. What did I miss?

dhildreth commented 7 years ago

Oh, shucks... it's a version issue. I'm on version 2.7.1 and I need 2.8 in order to use groups.