Closed yvesnyc closed 9 years ago
Hi @yvesnyc !
Can you provide some details about your setup, please? For example, what version of the Http Collector and Elasticsearch server are you using? Also, if you could share the configuration files of the Http Collector you have, that would be helpful.
Thanks!
Hi @pascaldimassimo ,
Thanks for working on this.
I am using norconex-collector-http-2.1.0 and norconex-committer-elasticsearch-2.0.1 and I'm running Elasticsearch 1.5.2. The configuration file is later modified in the code to set the maxdepth, starting url, and regex filter. Here is the config.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright 2010-2014 Norconex Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<httpcollector id="Complex Crawl">
#set($http = "com.norconex.collector.http")
#set($core = "com.norconex.collector.core")
#set($urlNormalizer = "${http}.url.impl.GenericURLNormalizer")
#set($filterExtension = "${core}.filter.impl.ExtensionReferenceFilter")
#set($filterRegexRef = "${core}.filter.impl.RegexReferenceFilter")
<progressDir>${workdir}/progress</progressDir>
<logsDir>${workdir}/logs</logsDir>
<crawlerDefaults>
<urlNormalizer class="$urlNormalizer" />
<numThreads>4</numThreads>
<maxDepth>2</maxDepth>
<workDir>$workdir</workDir>
<orphansStrategy>DELETE</orphansStrategy>
<!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
<sitemap ignore="true" />
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
<indexName>RepositoryRequestCrawl</indexName>
<typeName>webdoc</typeName>
<clusterName>es-crawler</clusterName>
</committer>
<referenceFilters>
<!-- Note we ignore Power Point and other possible repository desciptors -->
<filter class="$filterExtension" onMatch="exclude">jpg,gif,png,ico,css,js,sit,eps,wmf,ppt,mpg,xls,rpm,mov,exe,jpeg,bmp</filter>
<filter class="$filterRegexRef" onMatch="include">http://([a-z0-9]+\.)*redis\.io(/.*)?</filter>
</referenceFilters>
</crawlerDefaults>
<crawlers>
<crawler id="Norconex Redis Crawl">
<startURLs>
<url>http://redis.io</url>
</startURLs>
</crawler>
</crawlers>
</httpcollector>
This is the config.variables file:
workdir = ./crawl-output/mycrawler
I think the issue is caused by a Lucene version mismatch. In the norconex-committer-elasticsearch-2.0.1, we include lucene-core-4.10.4.jar, while in norconex-collector-http-2.1.0, we include lucene-core-5.0.0.jar. In lucene 5, Version.LUCENE_3_6 does not exist anymore, hence the NoSuchFieldError. Try removing lucene 5 in your lib folder and use only the lucene jars bundled with the committer to see if that fixes the issue.
@essiembre I see that the importer is now using Lucene 5. Are we using features exclusive to this version?
It is only bundled with the Importer as a dependency to the TitleGeneratorTagger
. If that tagger is not being used, one can remove the Lucene Jar bundled with the Importer without issues. The TitleGeneratorTagger
has quite a few dependencies and it maybe it shall be modified to not rely on so many, or maybe making them optional.
Thanks @pascaldimassimo and @essiembre. Yes, the inclusion of the lucene 5.0 jars was the problem.
Specifically, the solution was to remove the lucene-core-5.0.0 and lucene-analyzers-common-5.0.0 jars found in the norconex-collector-http-2.1.0 dependency. This was done with exclude directives in the dependency declaration. The norconex-committer-elasticsearch-2.0.1 dependency pulls in the 4.10.4 versions.
I am having trouble with the Elasticsearch committer. The crawler works fine but when it tries to send to Elasticsearch it get an "java.lang.NoSuchFieldError: LUCENE_3_6". I've tried looking around for the source of this error but ran out of ideas.
Here is the Exception output of the crawler:
Any ideas?
Thanks