Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
I'm getting a strange error when attempting to use TextPatternTagger. I'm hoping it's just something I'm doing wrong, but it seems like a strange one. My goal is to extract a thumbnail image for the page to display in search results. These images are identified by itemprop="image" followed by the URL source. I'd like to save to the field called thumbnail_url. I have it in my preParseHandlers section.
A quick check of the config file shows me some errors about the TextPatternTagger:
derek@solr:/var/norconex-collector-http-2.7.1$ ./collector-http.sh --checkcfg -c myredacteddomain-config.xml
INFO [AbstractCrawlerConfig] Reference filter loaded: ExtensionReferenceFilter[onMatch=EXCLUDE,extensions=zip,gif,jpg,jpeg,png,caseSensitive=false]
ERROR (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
ERROR [XMLConfigurationUtil$LogErrorHandler] (XML Validation) TextPatternTagger: cvc-complex-type.3.2.2: Attribute 'valueGroup' is not allowed to appear in element 'pattern'.
INFO [AbstractCollectorConfig] Configuration loaded: id=MySite Config HTTP Collector; logsDir=./myredacteddomain-output/logs; progressDir=./myredacteddomain-output/progress
There were 1 XML configuration error(s).
I'm getting a strange error when attempting to use
TextPatternTagger
. I'm hoping it's just something I'm doing wrong, but it seems like a strange one. My goal is to extract a thumbnail image for the page to display in search results. These images are identified byitemprop="image"
followed by the URL source. I'd like to save to the field called thumbnail_url. I have it in mypreParseHandlers
section.A quick check of the config file shows me some errors about the TextPatternTagger:
I'm referring to the documentation here: http://www.norconex.com/collectors/importer/latest/apidocs/com/norconex/importer/handler/tagger/impl/TextPatternTagger.html. What did I miss?